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Preface 


Generalized Method of Moments (GMM) has become one of the main statis- 
tical tools for the analysis of economic and financial data. Accompanying this 
empirical interest, there is a growing literature in econometrics on GMM-based 
inference techniques. In fact, in many ways, GMM is becoming the common 
language of econometric dialogue because the framework subsumes many other 
statistical methods of interest, such as Least Squares, Maximum Likelihood and 
Instrumental Variables. 

This book provides a comprehensive treatment of GMM estimation and 
inference in time series models. Building from the instrumental variables es- 
timator in static linear models, the book presents the asymptotic statistical 
theory of GMM in nonlinear dynamic models. This framework covers classical 
results on estimation, such as consistency and asymptotic normality, and also 
inference techniques, such as the overidentifying restrictions test and tests of 
structural stability. The finite sample performance of these inference methods 
is also reviewed. Additionally, there is detailed discussion of recent develop- 
ments on covariance matrix estimation, the impact of model misspecification, 
moment selection, the use of the bootstrap, and weak instrument asymptotics. 
There is also a brief exploration of the connections between GMM and other 
moment-based estimation methods such as Simulated Method of Moments, In- 
direct Inference and Empirical Likelihood. 

The computer scientist Jan van de Snepscheut once admonished that “in 
theory, there is no difference between theory and practice. But, in practice, 
there is.” Arguably a universal truth, this statement is certainly true about 
econometrics. Therefore, throughout the text, we focus not only on the theo- 
retical arguments but also on issues that arise in implementing the statistical 
methods in practice. All the inference techniques are illustrated using empirical 
examples in macroeconomics and finance. 

The text assumes a knowledge of econometrics, statistics and matrix algebra 
at the level of a course based on text such as William Greene’s Econometric 
Analysis. All the main statistical results are discussed intuitively and proved 
formally. The presentation is designed to be accessible to a first- or second-year 
student in a graduate economics program at an American university. 

This book developed out of lectures given at North Carolina State University. 
Parts of the material was also used as a basis for short courses at: the Division 
of Research and Statistics at the Board of Governors of the Federal Reserve 


vii 
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System in Washington D.C.; the Netherlands Graduate School of Economics; 
the Mansholt Graduate School of Social Sciences at Wageningen University in 
the Netherlands; the Department of Economics and Management at Wageningen 
University. Earlier drafts of the book were used by Eric Ghysels in a graduate 
econometrics course taught at Pennsylvania State University. I am very grateful 
to the participants in these courses for many useful comments and suggestions 
that have improved the book. 

I made considerable progress in translating these lecture notes into the chap- 
ters of this book during my tenure of a research fellowship at the Department 
of Economics at the University of Birmingham. I am indebted to this depart- 
ment for both this support and also the colleagial atmosphere that made my 
visit both productive and pleasurable. I also worked on the book while a short- 
term visitor at the Department of Economics and Management at Wageningen 
University and gratefully acknowledge this support. The rest of the work was 
undertaken at the Department of Economics at North Carolina State University, 
and I happy to have this opportunity to record my gratitude to the department 
and university for their support over the years of both my own work and also 
econometrics more generally. 

In the course of preparing the manuscript, a number of questions arose for 
which I had to turn to others for help. I would like to record my sincere grat- 
itude to the following for generously sharing their time in order to provide me 
with the answers: John Aldrich, Anil Bera, Ron Gallant, Eric Ghysels, Atsushi 
Inoue, Essie Maasoumi, Louis Maccini, Angelo Melino, Benedikt Pötscher, Bob 
Rossana, Steve Satchell, Wally Thurman, Ken West, Ken Vetzal, and Tim 
Vogelsang. A number of people have read various drafts of this work and pro- 
vided comments. This feedback was invaluable and I wish to thank particularly 
Ron Gallant, Eric Ghysels, Sanggohn Han, Atsushi Inoue, Kalidas Jana, Alan 
Ker, Kostas Kyriakoulis, Fernanda Peixe, Barbara Rossi, Amit Sen and Aris 
Spanos. 

This book took far longer to complete than I ever imagined at the outset of 
the project. Over the years, I have accumulated a considerable debt of grati- 
tude to: Lee Craig, who provided sagacious advice on various aspects of book 
authorship and literary style; Andrew Schuller, the editor, who provided con- 
tinual encouragement; and Jason Pearce who patiently answered my questions 
about Tex. I have pleasure in thanking all three for their help. 

However, my greatest debt is to my family. My wife Ada provided unfailing 
support throughout, and I dedicate this book to her and our son, Marten, as a 
token of my heartfelt gratitude. 


Raleigh, NC 
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Introduction 


1.1 Generalized Method of Moments in 
Econometrics 


Generalized Method of Moments (GMM) was first introduced into the econo- 
metrics literature by Lars Hansen in 1982. Since then it has been widely applied 
to analyze economic and financial data. This interest has both stimulated and 
been facilitated by the development of numerous statistical inference techniques 
based on GMM estimators. These applications have been in very diverse ar- 
eas Spanning macroeconomics, finance, agricultural economics, environmental 
economics and labour economics. Depending on the context, GMM has been 
applied to time series, cross sectional, and panel data. In this book we focus 
on the use of GMM estimation with time series data and illustrate the various 
inference procedures using examples from macroeconomics and finance.! These 
areas are arguably the ones in which GMM has been most widely applied and, 
consequently, has had the biggest impact. Table 1.1 gives a list of various areas 
of economics to which GMM has been applied; inevitably this list is not exhaus- 
tive. Many of the studies have been published in top economic journals, which 
is one measure of the importance of the technique. Nearly all the studies have 
been published since the early 1990s and this testifies to the increasing impact 
of GMM on empirical analysis in economics. 

It is natural to wonder why Hansen’s 1982 paper had such an impact. After 
all, Maximum Likelihood estimation (MLE) has been around since the early part 
of twentieth century and it is the best available estimator within the Classical 
statistics paradigm. The optimality of MLE stems from its basis on the joint 
probability distribution of the data, which in this context becomes known as 
the likelihood function. However, in some circumstances, this dependence on 
the probability distribution can become a weakness. In the models in Table 1.1, 
two particular problems are present and these have motivated the use of GMM. 


1 For discussions of GMM with panel data, see Baltagi (2001) or Wooldridge (2002). 
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These are as follows. 


1. Sensitivity of statistical properties to the distributional assumption 
The desirable statistical properties of MLE are only attained if the dis- 
tribution is correctly specified. Unfortunately, economic theory rarely 
provides the complete specification of the probability distribution of the 
data. One solution is to choose a distribution arbitrarily. However, unless 
this guess coincides with the truth, the resulting estimator is no longer 
optimal and, worse still, its use may lead to biased inferences. 


2. Computational burden 

For many of the models in Table 1.1, Maximum Likelihood estimation 
would be computationally very burdensome. Two types of problem tend 
to occur. In some cases, the economic model coincides with the joint 
probability distribution of the data but the implied likelihood function is 
extremely difficult to evaluate numerically with available computer tech- 
nology. In other cases, the economic model only involves some aspects of 
the probability distribution and the completion of the specification intro- 
duces many additional parameters which must also be estimated. Often 
in these latter cases, the likelihood function must be maximized subject 
to a set of nonlinear constraints implied by the economic model, which 
further adds to the computational burden. 


In contrast, the GMM framework provides a computationally convenient method 
of performing inference in these models without the need to specify the likelihood 
function. 

The cornerstone of GMM estimation is a set of population moment condi- 
tions which are deduced from the assumptions of the econometric model. The 
exact nature of these conditions varies from application to application but, what- 
ever they are, their validity is crucial for the properties of the resulting estima- 
tor. The potential of moment conditions for estimation has been recognized 
since the 1890s when a technique known as Method of Moments was first pro- 
posed. In fact, many estimation techniques familiar in econometrics are based 
either explicitly or implicitly on the information in population moment condi- 
tions. However, prior to Hansen’s work, the statistical theory of these estimators 
tended to be restricted to the moment conditions of a particular functional form. 
One of the main contributions of Hansen’s paper was to emphasize the common 
underlying structure of these previous analyses and to develop a statistical the- 
ory which can be applied to any set of moment conditions. Inevitably, GMM 
builds on these earlier analyses and so to help put GMM in perspective, it is 
useful to understand its statistical antecedents. Therefore, we start by briefly 
summarizing in Section 1.2 how the use of moment conditions has evolved in 
statistics and econometrics. This provides a first illustration of how moment 
conditions can be used as a basis for estimation. It also links GMM to a num- 
ber of estimators familiar in econometrics. After this historical review, a set 
of contemporary examples from Table 1.1 are provided in Section 1.3. At this 
stage, the focus is on showing how the population moment conditions arise in 
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Table 1.1 
Applications of GMM 


Agriculture 


Business cycles 


Commodity 
markets 
Consumption 


Cost/Production 
frontiers/functions 
Development 


Economic growth 
Education/human 
capital 
Environmental 
economics 
Equity pricing 


Exchange rates 


Thijssen (1996), Chavas and Thomas (1999), Bourgeon 
and Le Roux (2001) 

Singleton (1988), Christiano and Eichenbaum (1992), 
Burnside, Eichenbaum, and Rebelo (1993), Braun 
(1994), Boldrin, Christiano, and Fisher (2001) 

Deaton and Laroque (1992), Bjornson and Carter 
(1997), Considine and Heo (2000), Haile (2001) 

Miron (1986), English, Miron, and Wilcox (1989), 
Campbell and Mankiw (1990), Runkle (1991), Blundell, 
Pashardes, and Weber (1993), Blundell, Browning, and 
Meghir (1994) Attanasio and Browning (1995), Attanasio 
and Weber (1995), Ni (1995), Meghir and Weber (1996), 
Dynan (2000), Fuhrer (2000), Weber (2000) 

Kopp and Mullahy (1990), Blundell and Bond (2000), 
Ahn, Good, and Sickles (2000) 

Jalan and Ravallion (1999), Hansen and Tarp (2001), 
Ogaki and Zhang (2001) 

Caselli, Esquivel, and Lefort (1996) 

Angrist and Krueger (1992), Palacios-Huerta (2003) 


Smith and Pattanayak (2002) 


Hansen and Singleton (1982), Singleton (1985), Finn, 
Hoffman, and Schlagenhauf (1990), Ghysels and Hall 
(1990a,b) Ferson (1990), Bodurtha and Mark (1991), 
Epstein and Zin (1991), Ferson and Constantinides 
(1991), Harvey (1991), MacKinlay and Richardson 
(1991), Snow (1991), Bessembinder and Chan (1992), 
Ferson and Harvey (1992), Ilmanen (1992), Marshall 
(1992), Bansal, Hsieh, and Viswanathan (1993), Bansal 
and Viswanathan (1993), Cecchetti, Lam, and Mark 
(1993), Ferson, Foerster, and Keim (1993), Fisher 
(1994), Zhou (1994), Campbell (1996), Cochrane 
(1996), Hansen and Singleton (1996), He, Kan, Ng, 
and Zhang (1996), Ho, Perraudin, and Sgrensen 
(1996), Hagiwara and Herce (1997), Hansen and 
Jaganathan (1997), Ghysels (1998), Garcia and Bonomo 
(2001), Timmerman (2001), Jiang and Knight (2002), 
Vissing-Jgrgenson and Attanasio (2003) 

Hansen and Hodrick (1980), Mark (1985), Melino 
and Turnbull (1990), Modjtahedi (1991), Bekaert and 
Hodrick (1992), Cumby and Huizinga (1992), Backus, 
Gregory, and Telmer (1993), Imrohoroglus (1994), 
Dumas and Solnik (1995), Hartmann (1999), Bekaert 
and Hodrick (2001), Groen and Kleibergen (2003) 


continued over 
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Table 1.1 (continued) 
Applications of GMM 


Health care 


Import demand 
Interest rates 


Inventories 


Investment 


Labour demand 


Labour market 
Labour supply 


Macroeconomic 
forecasts 

Microstructures 
in finance 


Money 


Mutual fund 
performance 

Product demand 

Productivity 


R & D spending 
Resources 
Technological 
innovation 
Trading volume of 
financial assets 
Transportation 


Windmeijer and Silva (1997), Schellhorn (2001), Silva and 
Windmeijer (2001) 

de la Croix and Urbain (1998) 

Dunn and Singleton (1986), Diba and Oh (1991), Lee 
(1991), Chan, Karolyi, Longstaff, and Sanders (1992), 
Longstaff and Schwartz (1991), Cushing and Ackert 
(1994), Vetzal (1997), Green and Odegaard (1997) 
Miron and Zeldes (1988), Eichenbaum (1989), Kayshap 
and Wilcox (1993), Durlauf and Maccini (1995), Fuhrer, 
Moore, and Schuh (1995a), Bils and Kahn (2000) 
Gordon (1992), Hubbard and Kayshap (1992), Whited 
(1992), Bond and Meghir (1994), Gilchrist and 
Himmelberg (1995), Oliner, Rudebusch, and Sichel 
(1996), Chirinko and Schaller (1996), Ogawa and Suzuki 
(1998), Chirinko and Schaller (2001) 

Pindyck and Rotemberg (1983), Arellano and Bond 
(1991), Pfann and Palm (1993) 

Yashiv (2000), Yuan and Li (2000) 

Mankiw, Rotemberg, and Summers (1985), Eichenbaum, 
Hansen, and Singleton (1988), Kahn and Lang (1991), 
Angrist (2001) 

Keane and Runkle (1990), Bonham and Cohen (1995, 
2001) 

Madhavan and Smidt (1993), Huang and Stoll (1997), 
Madhavan, Richardson, and Roomans (1997), Biasis, 
Hillion, and Spatt (1999), Grammig and Wellner (2002) 
Eckstein and Leiderman (1992), Dutkowsky (1993), 
Holman (1998), Clarida, Gali, and Gertler (2000) 

Chen and Knez (1996), Bekaert and Urias (1996) 


Berry, Levinsohn, and Pakes (1995) 

Bernstein (1994), Atkinson, Cornwell, and Honerkamp 
(2003) 

Himmelberg and Petersen (1994) 

Young (1991, 1992), Green and Mork (1991), Popp (2001) 
Blundell, Griffith, and Vanreenen (1995) 


Foster and Viswanathan (1993), Bessembinder, Chan, and 
Seguin (1996) 
Nevo (2003) 


in these models. 


Later in the book, we return to these models to illustrate 


the various estimation and inference procedures discussed. The development of 
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these procedures requires certain statistical concepts and results. Section 1.4 
provides a review of some background statistical theory which is needed for the 
introduction of the basic GMM framework in Chapters 2 and 3. More advanced 
statistical theory is developed as necessary in subsequent chapters. Section 1.5 
concludes the chapter with an overview of the remainder of the book. 


1.2 Population Moment Conditions and the 
Statistical Antecedents of GMM 


The term population moment was originally used in statistics to denote the 
expectation of the polynomial powers of a random variable. So if vs is a discrete 
random variable with probability mass function P(v, = v) defined on a sample 
space V then its rt” population moment is given by 


E{w;] = 5 v” P(v, =v) = vr 
{vev} 


where the summation is over all values in VY and r is a positive integer. If v, is a 
continuous random variable with probability density function p(v) then its rt” 
moment is given by ” 
E|v;] = J v”p(v)dv = vr 

—co 
From these definitions it is easily recognized that the population mean is just 
the first population moment and the population variance is v2 — v?. The term 
(population) moment has been in the statistical lexicon since at least the work 
of A. Quetelet who lived from 1796 to 1874 and was inspired by the concept of 
moments in physics, see Stuart and Ord (1987, p.53).? 

Karl Pearson? (1893, 1894, 1895) was the first person to recognize the po- 
tential of population moments as a basis for estimation. In this series of articles, 
he introduced Method of Moments estimation. To understand his original mo- 
tivation, it is necessary to consider briefly the state of statistical analysis in the 
late nineteenth century. During that century, a lot of natural phenomena were 
thought to be well summarized by a normal distribution. This belief can be at- 
tributed to at least two reasons. First, the actual evidence was limited, because 
only a few data sets had been collected. Secondly, the available diagnostic tests 
were very rudimentary and could only detect very dramatic departures from 
normality; see Stigler (1986, p.330) . However, as interest in statistics — and 


2 Adolphe Quetelet was a Belgian with far ranging interests. He wrote the libretto of an 
opera, a historical survey of romance and poetry as well as his scientific work in astronomy, 
sociology and statistics. Pearson (1895) described him as a man “who often foreshadowed sta- 
tistical advances without providing the method by which they might be dealt with” (Pearson, 
1895, p.381). For an interesting discussion of Quetelet’s contributions see Stigler (1986). 

3 Karl Pearson (1857-1936) was an Englishman trained as a mathematician whose inter- 
ests also included physics, German history, folklore and philosophy. Apart from Method of 
Moments, his numerous contributions to statistics included chi-squared goodness of fit tests, 
correlation and the Pearson family of distributions. 
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science — grew, more datasets were collected. With this growing body of empir- 
ical evidence, researchers became aware that many natural phenomena showed 
departures from normality and in particular exhibited skewness. This raised the 
challenge of finding theoretical probability distributions which could adequately 
capture this behaviour. Karl Pearson was in the forefront of this research and 
developed what has become known as the Pearson family of distributions, e.g. 
see Stuart and Ord (1987, pp.210—20). This family is characterized by a proba- 
bility density function which is indexed by a vector of four parameters. Different 
values of the parameters can yield a wide variety of distributions, including the 
normal, beta and gamma. 


The practical problem was to find the most appropriate member of this 
family for the data set in hand — or in other words, to estimate the parameter 
vector. The existing techniques for fitting normal distributions were not suited 
to these more general types of distribution. Instead, Pearson suggested calcu- 
lating estimates based on moments. The idea is simple. Population moments 
implied by the family of distributions are functions of the unknown parameter 
vector. Pearson proposed estimating the parameter vector by the value implied 
by the corresponding sample moments. His approach is best understood by 
considering a simple example. For the purposes of our discussion we can ab- 
stract from the generality of the Pearson family and just focus attention on a 
particular member, the normal distribution. This distribution depends on just 
two parameters:* the population mean, jo, and the population variance, oĉ. 
These two parameters satisfy the population moment conditions 


(1.1) 


Elv;|— (99 +u) = 


Pearson’s method involves estimating (10,09) by the values (jir,67.) which 
satisfy the analogous sample moment conditions and we have indexed the esti- 
mators by the sample size T. Therefore (fir, 7) are the solutions to 


‘i 
= 
l 
Ra 
ae 
II 
(æl 


and so, with some rearrangement, it follows that 


4 The normal distribution is obtained from the generic form of the Pearson family by 
setting two of the four parameters to zero. 
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z (1.2) 
6h = TIY (v - fir)? 


Pearson called this approach the “Method of Moments” for obvious reasons. 
Pearson (1895) demonstrated the power of this technique with an analysis of 
the distributions of such diverse phenomena as barometric pressures, the sizes 
of the carapace of crabs, the heights of recruits to the U.S. army, the valuation 
of house prices and the number of divorces granted. 

This approach is very intuitive but not without its weaknesses. For exam- 
ple, all the higher moments of the normal distribution depend on (Ho, 0); e.g. 
see Stuart and Ord (1987, p.78). Therefore, this technique could have been 
applied equally well to the third and fourth moments, say, of the distribution. 
The problem is that the resulting estimators of (40,03) would be different from 
those given in (1.2). Which estimators should be used? This question is hard 
to address within the Method of Moments framework. In fact, it was this ques- 
tion which led R. A. Fisher to analyze how information from a probability 
distribution can be channeled most effectively into parameter estimation. The 
result was the Maximum Likelihood principle; see Fisher (1912, 1922, 1925). 
In fact, MLE can also be interpreted as a special case of GMM based on a 
population moment condition whose derivation requires the specification of the 
probability distribution of the data. However, it is pedagogically most conve- 
nient to postpone further discussion of this interpretation until the complete 
GMM framework has been introduced in Chapter 3.6 For our purposes here, it 
is more relevant to consider another weakness inherent in the Method of Mo- 
ments framework. Suppose that it is desired to base estimation of (uo, o2) on 
the first three moments of v+, that is (1.1) plus 


Ble’) — 3E[e2]uo + 3B [vez — wg = 0 (1.3) 


In this case, the sample analogs to (1.1)—(1.3) form a system of three equations 
in two unknowns, and such a system typically has no solution. Therefore, the 
Method of Moments is infeasible. It is easily recognized that this problem is 
not specific to this example. Clearly, some modification is needed in order to 


5 Ronald Fisher (1890-1962) was an English scientist who made fundamental contributions 
to statistics, probability, genetics and the design of experiments. He is regarded by many as 
the founder of mathematical statistics. Apart from Maximum Likelihood, he developed the 
general framework of estimation theory including the concepts of consistency, information, 
sufficiency, efficiency, ancillarity and pivotal statistics. His other famous contributions include 
the analysis of variance method and the F-distribution. 

6 For completeness, we note that if it is assumed in our simple example that {v:,t = 
1,2..., T} are also independently distributed then (fir, 62.) are the MLE’s ; e.g. see Stuart 
and Ord (1987, p.287). However, this coincidence is the exception rather than the rule. In 
general, ML estimation does not involve matching these type of simple population moment 
conditions; see Section 3.6 for further discussion. 
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produce estimates of p parameters based on more than p population moment 
conditions. This brings us to the second important statistical antecedent of 
GMM, namely the method of Minimum Chi-Square. 

In a series of articles in the late 1920s and the 1930s, Neyman and Pearson 
laid the foundations for the framework of “classical” hypothesis testing.” One 
side product of this research was the Minimum Chi-Square method of estima- 
tion. The method was originally proposed to facilitate inference about whether 
or not an observed sample was generated from a particular distribution, but the 
basic idea can be applied to estimation in a wide variety of problems includ- 
ing the estimation of (uo,03) based on (1.1)—(1.3). However, it is instructive 
to introduce the method in the context of the specific example considered by 
Neyman and Pearson. 

Neyman and Pearson (1928) considered the particular case in which a re- 
searcher wishes to model the probability that the outcome of an experiment lies 
in one of k mutually exclusive and exhaustive groups. If p; is used to denote the 
probability the outcome lies in the i*” group then the null hypothesis of interest 
is that 

pi = h(i, 0o) (1.4) 
where h(.) is some specified functional form indexed by an unknown parameter 
vector ło. The question was how to test this hypothesis. In 1928, the challenging 
feature of this problem was that the null hypothesis only specified the form of 
the probability function up to some unknown parameter vector. At that stage, 
the problem had only been solved if the null specified a particular value of 6 
as well. In the latter case, Karl Pearson (1900) had shown that inference could 
be based on the goodness of fit statistic, 


k ios i: 2 
GFr(0) = yee (1.5) 


where T; is the frequency of outcomes in the i*” group in a sample of size T. 


Pearson (1900) showed that this statistic was approximately distributed y7_, 
under the null hypothesis. Neyman and Pearson (1928) recognized that if 0o is 
unknown then the goodness of fit statistic can provide the basis for estimation 
of ĝo as well as inference about the null hypothesis. Their idea was to estimate 
o by br, the value of 0 which minimizes the goodness of fit statistic.? In view 
of Pearson’s (1900) aforementioned distributional result, Neyman and Pearson 


T Jerzy Neyman (1894-1981) was born in Russia but came from a Polish family. Egon 
S. Pearson (1895-1980) was the son of Karl Pearson. Their collaboration began in the mid- 
1920s when Neyman held a post doctoral fellowship to study under Karl Pearson at University 
College of London where Egon Pearson was also on the faculty. Apart from their seminal work 
together, both made numerous other contributions to statistics including Neyman’s work on 
the theory of survey sampling, estimation by confidence sets and best asymptotically normal 
estimators, and Pearson’s work on quality control and operations research. 

8 Notice that the degree of freedom of the distribution is only k — 1 and not k because 
once the frequencies in k — 1 groups are known then the frequency in the kt? group is auto- 
matically determined by Tk = T — soit T;. 

9 This insight was not completely new even in 1928. Smith (1916) discussed the idea of 
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(1928) refered to Êr as a “Minimum Chi-Square estimator”. Furthermore, they 
showed that under the null hypothesis in (1.4), GF (67) is approximately dis- 
tributed x? with k — 1 — p degrees of freedom where p denotes the dimension of 
A. 

At first glance, it may not be readily apparent that there is any connection 
between the estimation problem considered by Neyman and Pearson (1928) and 
the problem of how to estimate (9,02) based on the first three moments of the 
normal distribution. However, both problems actually have the same underlying 
structure. To uncover this connection, it is necessary to view Neyman and 
Pearson’s (1928) method from a slightly different perspective. To develop this 
new interpretation, it is necessary to rewrite the goodness of fit statistic and 
introduce a set of indicator variables. First, note the goodness of fit statistic 


can be written as ie 
Dj —h j; 0 2 
GFr(0o) =T" y Pent G o)] 


i=l 


(1.6) 


where p; = T;/T, the relative frequency in the sample of outcomes in the jth 
group. Now consider the set of indicator variables {D;(i);i = 1,2,...k;¢ = 
1,2,... T} which take the value one if the t” outcome of the experiment lies in 
the i!” group and takes the value zero otherwise. Notice that if (1.4) is true then 
it follows that P(D;(i) = 1) = h(i; 0o), and hence that E[D;(i)] = h(i; 0o). So, 
using these indicator variables, it can be seen that (1.4) implies the following 
vector of k population moment conditions 


D,(1) — h(1; 90) 
D,(2) — h(2; 90) 
E . = 0 (1.7) 


Di(k) — hk; 60) 


Since Si {Di (i) — h(t; 0o)} = 0 by definition, only k — 1 of the population 
moment conditions actually provide unique information about #9. However, we 
retain all k to elicit the connection with the goodness of fit statistic. If k—1 > p 
— which we have assumed implicitly all along — then these population moment 
equations can be used to estimate 99. The sample analogs to (1.7) are given by 


pi — h(1;0) 
po — h(2;6) 

zi (1.8) 
Pr — h(k; 6) 


choosing estimators to minimize the goodness of fit statistic. However, her focus was on trying 
to uncover a sense in which Method of Moments estimators could be considered optimal. In 
fact, she found that Method of Moments estimators gave a good approximation to the values 
which minimized the goodness of fit statistic in the examples considered in her paper. This 
finding may explain why this alternative method of estimation was not explored more fully 
until twelve years later. See Bera and Bilias (2002) for further discussion of the origins of 
Minimum Chi-Square. 
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The elements on the left hand side of (1.8) can be recognized as the same 
terms which appear inside the square in the numerator of the version of the 
goodness of fit statistic in (1.6). We are now in a position to establish the 
connection between Minimum Chi-Square estimation of 0o and estimation based 
on the population moment conditions in (1.7). First consider the case in which 
there are as many unique moment conditions as unknown parameters, that is 
k —1 = p. By definition, the Method of Moments estimator, Or say, satisfies 
pi — h(i, Or) = 0 for i = 1,2...p.° This property implies that GFr(6r) = 0, 
and since GFr(0) > 0, it must follow that Êr also minimizes GF (0). So 
if k— 1 = p then the Minimum Chi-Square estimator is just the Method of 
Moments estimator based on (1.7). Now consider the case in which there are 
more unique moment conditions than parameters, that is k—1 > p. In this case, 
the principle of Method of Moments estimation does not work, but Minimum 
Chi-Square is still valid. The key difference is that Method of Moments is defined 
as the solution to a set of moment conditions and this solution only exists if 
k —1=p, whereas Minimum Chi-Square is defined in terms of a minimization, 
which can be performed for any k — 1 > p. This suggests that to estimate 
(uo, 0) from the first three moments of the normal distribution, it is necessary 
to formulate the estimation in terms of a minimization. To implement such 
a strategy, it is necessary to specify an appropriate minimand. Once again, 
Minimum Chi-Square provides the answer. It is easily verified that 


pi — a0) T pe oO . . 0 Bi — h(1;8) 
Pa — h(2;0) E eh 20 Ba — h(2;8) 
GFr(0) =T 
Br — h(k;0) 0 0 .. pt | | ôk- hlk;0) 
(1. 


and so GFr(0) can be interpreted as a quadratic form in the sample moment 
condition (1.8). Notice that the matrix in the centre of (1.9) is positive defi- 
nite!’ by construction and so ensures that GFr(0) > 0. This structure leads to 
the following intuitively appealing interpretation of the Minimum Chi-Square 
estimator: it is the value of 0 which is closest to solving the sample moment 
conditions in the metric of GF7(0). 

It takes only a little reflection to realize that the same approach can be 
applied to the estimation of any problem in which there are more moments 
than parameters to be estimated. To illustrate how, let us return to estimation 
of (uo, 04) based on (1.1)-(1.3). For this problem, the minimand takes the form 

mMy(1) =p 
MCr(p, 02) = my (2) — (02 + p?) Mr x 
mMy(3) — 3mo (2)u + 3m, (1)? — p? 


10 Note that we can obtain Êr by solving any k — 1 of the sample moment conditions 
in (1.8), and that the estimator must satisfy the remaining sample moment condition be- 
cause 5i (8: — h(i; 67)} = 0 by construction. 

11 The goodness of fit statistic is undefined unless p; > 0 for all i. 
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My(1) = u 
My(2) — (o? + p’) (1.10) 
my (3) — 3M, (2)u + 3M (1) u? — u’? 


where Mr is a positive definite matrix which may depend on T, and m, (i) = 
ah Eai vi. Notice that this minimand embodies two modifications of (1.9) 
beyond the choice of sample moments. First, the scaling factor, T, has been 
omitted, because it has no impact on the minimization. Secondly, we have not 
specified an exact form for the matrix in the quadratic form; it can be any 
positive definite matrix. The Minimum Chi-Square estimators of (uo, 0ĝ) are 
the values of (u, o?) which minimize MCr(p, 07). 


This connection between Minimum Chi-Square and moment based estima- 
tion seems to have been made first during the late 1940s and the 1950s. It was 
certainly at this time that researchers began to realize the potential generality 
of the method, although their perspective was limited inevitably by the com- 
putational constraints of that time. Ferguson (1958) developed the statistical 
theory for the estimator in the case where the population moment condition 
takes the form E[g(v:)] — h(@) = 0 and w is an iid. process.‘2 However, 
for some reason, his contribution appears not to have impacted on economet- 
rics — perhaps because the functional form of the moment condition was not 
particularly appropriate for econometric applications of that time. However, 
with hindsight, it can be recognized that the statistical framework developed by 
Ferguson (1958) contains many of the elements which reappeared in the GMM 
literature twenty-five years later albeit in a far more general context. 


The third important antecedent of GMM is the method of Instrumental Vari- 
ables (IV) estimation. Unlike Method of Moments and Minimum Chi-Square, 
IV was specifically developed to exploit the information in moment conditions 
for the estimation of structural economic models. This method appears to have 
been first applied in an analysis of demand and supply of agricultural commodi- 
ties in the 1920s. In both an U.S. Department of Agriculture Bulletin (Wright, 
1925), and also in the appendix to his father’s book, The Tariff on Animal and 
Vegetable Oils (Wright 1928), Sewall Wright showed how Method of Moments 
could be used to estimate the parameters of supply and demand equations. 
He presented these estimators using a technique known as “Path Analysis”, but 
it is most convenient to adopt an alternative approach which has become the 
standard derivation in econometric textbooks. To illustrate we consider the sys- 
tem of equations 


12 Ferguson (1958) also considers a number of variations on this estimation problem, some 
of which had been analyzed earlier by Barankin and Gurland (1951). Also see Neyman (1949). 

13 Sewall Wright (1889-1988) was an American who is best known for his work on popu- 
lation genetics. Following his position at the USDA, he became Professor of Zoology at the 
University of Chicago and is considered to be of the three founders of modern theoretical 
population genetics. 


12 Introduction 


q = copy+ur 
G = Bone + 2o,2Pt +uP (1.11) 
a = q? = 4 


where q? , qf represent demand and supply in year t, p+ is the price of the 
commodity in that year and n; is a vector containing factors that affect supply. 
The market is assumed to clear and the total quantity produced is denoted q. 
For our purposes here, it suffices to consider the problem of how to estimate 
Qo given a sample of T observations on q and p;. An Ordinary Least Squares 
(OLS) regression of q on pẹ runs into problems here because price and out- 
put are simultaneously determined and this causes OLS estimates to be biased, 
e.g. see Judge, Griffiths, Hill, Lutkepohl, and Lee (1985, p.570). Sewall Wright 
solved these problems as follows. Suppose there is an observable variable zP 
which is related to price but whose covariance with uP”, Cov[zP,uP], is zero. 
An example would be any of the factors that affect supply, such as an input 
price or yield per acre. Then by taking the covariance of zP with both sides of 
the demand equation in (1.11) it follows that 


Cov[z?, a] — acCov[zP, pi] = 0 (1.12) 


It is convenient to simplify this moment condition using other properties of the 
model. Typically, it is assumed that E[uP?] = 0 and so E[q@] = aoE|p;]. Using 
this identity in (1.12), the moment condition can be rewritten as! 


ElzP qi] — aE [zPpi] = 0 (1.13) 


Equation (1.13) provides a population moment condition involving the observ- 
able variables and the unknown parameter, œo, which can be used as a basis 
for estimation. Pearson’s Method of Moments principle leads to the estimation 
of the parameters by the values which solve the analogous sample moments, 
namely 


T T 
ar = eer aa (1.14) 
t=1 t=1 
This equation can be recognized as what is known today as an Instrumental 
Variables estimator with z? being refered to as the “instrument”. However this 
term was not coined until the 1940s when IV was rediscovered and came to stay 
in econometrics. In fact, Wright’s work was largely ignored by economists until 
Goldberger (1970) returned it to its rightful place in the history of econometrics. 
A similar Method of Moments reasoning was used in the 1940s. However, 
this time, IV was proposed as a solution for the problems caused by errors in 
variables. To illustrate, consider the case in which 


Ye = yor? + U1,t (1.15) 


14 Recall that for any two random variables a and b, Cov[a, b] = E[ab] — E[a]E[b). 
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but x? is only observed with error, 
7120 
Lt = Li + U2 


Since the regressor is unobserved, equation (1.15) cannot be estimated directly. 
Instead inference is based on 


Ye = oTt + Ut (1.16) 


Ordinary Least Squares estimation of (1.16) is biased because x; and uz = 
u1 t — youz, are correlated; e.g. see Judge, Griffiths, Hill, Lutkepohl, and Lee 
(1985, p.705-8). Reiersøl (1941) and Geary (1942, 1943) independently proposed 
solving this problem by introducing a variable z; which is correlated with x; but 
uncorrelated with u.!5 Using the same intuition as Wright, Reierosol and Geary 
deduced the moment condition 


Cov[zz, yi] — YoCov|z, s+] = 0 


The Method of Moments estimation principle leads to the analogous formula to 
(1.14) for the IV estimator of yo. 

Reiersgl (1945) introduced the term “instrumental variables” and Geary 
(1949) derived certain statistical properties of the estimator in the context of the 
errors in variables model. Durbin (1954) extended the method to simultaneous 
equation models, and Sargan(1958, 1959) provided the first complete theoreti- 
cal analyses of the estimator.!© Building from this basis, the IV framework has 
become so developed that, prior to the introduction of GMM, it was typically 
treated in econometrics as an estimation technique in its own right rather than 
being perceived as an example of the Method of Moments.” Within this lit- 
erature on IV, Amemiya (1974) and Jorgenson and Laffont (1974) played an 
important role in extending the method to nonlinear models, and the statistical 
theory employed in these papers is an important precursor to the arguments 
used to analyze the properties of GMM. 

The above discussion has illustrated some of the problems to which mo- 
ment based estimation has been applied. Over the years, considerable attention 
has been focused on analyzing the properties of these estimators and various 
associated inference techniques. However, this theory has tended to place re- 
strictions on the functional form of the population moment condition. One of 


15 See Morgan (1990, p.220-8) and Aldrich (1993) for more detailed discussions of the 
emergence of IV in the 1940s. Olav Reiersgl (1908— ) is a Norwegian statistician who made 
a number of important contributions to econometrics, most notably through his work on IV 
and identification. He also contributed to other areas of statistics as well as genetics. Robert 
(Roy) Geary (1896-1983) was an Irishman who worked as a government statistician in Dublin 
for most of his career. Apart form his work in mathematical statistics, he is also known for 
being one of the pioneers in the field of national income accounting. 

16 See Arellano (2002) for an appraisal of the connection between Sargan’s work and GMM. 

17 There are some exceptions. For instance, Burguette, Gallant, and Souza (1982) use the 
term “Method of Moments” to denote a class of estimators of the parameters of nonlinear 
static simultaneous equation model which includes IV estimators. 
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the main contributions of GMM is to provide a framework for the statistical 
analysis based on essentially any population moment condition. Accordingly, it 
is necessary to adopt a broad definition of a population moment condition. 


Definition 1.1 Population Moment Condition 

Let Oo be a vector of unknown parameters which are to be estimated, vį be a 
vector of random variables and f(.) a vector of functions then a population 
moment condition takes the form 


E[f (ve, 90)] = 0 (1.17) 
for all t. 


This definition encompasses the examples discussed above. For instance, the 
moment condition in (1.1) can be obtained from (1.17) by putting 


Ut — Ho 
v 0) = 
P= a2 — (08 + 08) 


where ĝo = (uo, 02)’. Wright’s example in (1.13) is obtained by putting 
(ve, 0) = 2P a — aoz? pe 


where 1% = (zP , qt, pt)! and 09 = ao. 

Just as in Minimum Chi-Square, GMM involves choosing parameter estima- 
tors to minimize a quadratic form in a weighting matrix, Wr, and the sample 
moment T7! 52] f (v1, 0). 


Definition 1.2 Generalized Method of Moments Estimator 
The Generalized Method of Moments estimator based on (1.17) is the value of 
0 which minimizes: 


T 


T 
Qr(0) = TY f(,0)Wrl* X f(r, 0) (1.18) 


t=1 


where Wr is a positive semi-definite matrix which may depend on the data but 
converges in probability to a positive definite matrix of constants. 


The restrictions on the weighting matrix are required to ensure that Qr(6) 
is a meaningful measure of distance. Notice that the positive semi-definiteness 
of Wr ensures both that Q7(0) > 0 for any 9, and also that Qr(ôr) = 0 
if T7! an f(v Êr) = 0. However, positive semi-definiteness leaves open the 
possibility that Qr(6r) is zero at a value of Êr which does not satisfy the sample 
moment conditions. Since all our analysis is based on asymptotic theory, it is 
only necessary to rule out this eventuality in the limit as T — oo. 

A comparison of (1.10) and (1.18) indicates that Minimum Chi-Square and 
GMM are essentially the same method. With hindsight, it might be argued 
that a new terminology was not really needed. However, Hansen (1982) refered 
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to the estimator in Definition 1.2 as “Generalized Method of Moments”, and 
that is the name by which the method is known in econometrics.!® We shall, 
therefore, follow this practice. 

The next section presents five examples of moment conditions from models 
in Table 1.1. These models have been carefully selected because they provide 
convenient illustrations of many of the issues discussed in this book. Here, 
the focus is on showing how the population moment conditions arise and the 
potential problems encountered with maximum likelihood estimation in these 
models. 


1.3 Five Examples of Moment Conditions in 
Economic Models 


1.3.1 Consumption-Based Asset Pricing Model 


The consumption-based asset pricing model is used by financial economists to 
explain how assets are priced and by macroeconomists to explain the evolution 
of consumption spending. To see how this can be done, it is necessary first 
to present the model formally and derive the population moment conditions 
which are the basis for GMM estimation. The ultimate aim of the model is 
to explain aggregate movements. This is done using a framework in which 
aggregate outcomes are assumed to be the result of the decisions made by a single 
“representative” agent. This representative agent approach is certainly open to 
criticism, e.g. see Kirman (1992), but nevertheless has received considerable 
attention in the literature. The general theoretical structure was first developed 
by Lucas (1978). However, Hansen and Singleton (1982) were first to highlight 
and exploit the potential of GMM in these types of models. 

Consider the case where a representative agent makes decisions about con- 
sumption expenditures and investment to maximize his/her expected discounted 
utility 


ELD) 66U (ce+a)|2¢] 
1=0 


where c+ is consumption in period t, U(.) is a strictly concave utility function, 
ĉo is a constant discount factor and Q; is the information set available to the 
agent at time t. In any period the agent can choose to spend his/her income 
on either goods for consumption or investments in a collection of N assets with 
maturities mj, j = 1,2,...N. Let qj be the quantity of asset j held at the end 


18 In fact, this terminology originates from a set of unpublished lecture notes produced 
by Christopher Sims for his graduate econometrics course at the University of Minnesota. 
Interestingly, Sims used the term to denote an estimator which is obtained by solving a linear 
combination of moment conditions rather than via the minimization in Definition 1.2. Hansen 
developed certain statistical results for Sim’s estimator as part of his Ph.D. thesis submitted 
to the University of Minnesota in 1978. Hansen and Sims provide interesting background on 
the genesis of the method in interviews published in the October 2002 issue of the Journal of 
Business and Economic Statistics. 
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of period t, p; be the price of asset j at time t, rj be the period t payoff from 
a unit of the jt” asset purchased in period t — mj, and w; be real labour income 
in period t. All prices are denominated in terms of the consumption good.!9 
The budget constraint is 


N 


N 
Ce + X Pi ttj t = X Tj tllj t-m; 1 W 
j=l j=l 


for all t. The optimal path of consumption and investment satisfies 
Pj tU’ (cr) = 59? Elri tm; U" (Ctm; )| Oe] (1.19) 


for all t and j = 1,2,...,N, where U'(c) denotes the marginal utility of con- 
sumption. This condition states that the utility lost by foregoing consumption 
in period t to purchase a unit of asset j, pj +U” (c+), must equal the value in period 
t of the expected utility gained from consuming the return on the investment in 
period t+ mj, 69° E[rjt+m;U'(Ct+m;)|%]. Equation (1.19) can be rewritten as 


Elo” (rjt+m; /Pjt {U (Cem; )/U" (ce) }/M%] — 1 = 0 (1.20) 


for j=1,2,...N. Equation (1.20) is refered to as the Euler equation of the system, 
after the mathematician Leonhard Euler (1707-83) who derived an analogous 
equation to characterize the solution path in the calculus of variations problem. 
The Euler equation places a restriction on the co-movements of consumption and 
asset prices and so can be used by macroeconomists and financial economists to 
learn about these variables. 

So far, the analysis has been in terms of a general utility function, but to 
make (1.20) operational it is necessary to specify a particular functional form. 
At this stage it is most convenient to follow Hansen and Singleton (1982) and 
define 


YO _ 1 

U(c) = ———— (1.21) 

Yo 

The parameter yọ must be less than one for the utility function to be concave. 
This functional form is known as the constant relative risk aversion (CRRA) 
utility function because the relative risk aversion of the representative agent is 
(1 — yo) at any level of consumption. Differentiating (1.21) and making the 
appropriate substitutions into (1.20), the Euler equation becomes 


E[S (Tj ttm; /Dj,t)(Cetm, /ce)? *|Q4] — 1 = 0 (1.22) 


Clearly with this specification there are two parameters to be estimated, namely 
(yo, 60). Taking unconditional expectations of the Euler equation provides one 
population moment condition involving these parameters, but, in fact, (1.22) 
implies many more moment conditions. If we set 


uj,4(Y0, 60) = by” (hjit; /D5,t)(Ct+m; fee —1 


19 In other words, pjt is the price of the asset in dollars divided by the price of the con- 
sumption good in dollars. 
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then an iterated conditional expectations argument can be used in conjunction 
with the Euler condition in (1.22) to show that 


E[u; (y0, 60) z] = E[ E[uj.2(yo, ĝo) |Q] z] = 0 (1.23) 


for any vector zt E€ Q:. In this context, z, might include a constant, which 
amounts to taking the unconditional expectation of the Euler equation, and vari- 
ables such as fj, t/Pj,t—-m;, Ct/Ct—m, Or indeed any other macroeconomic variables 
contained in the representative agent’s information set. The moment conditions 
in (1.23) provide the basis for GMM estimation of the parameters (yo, ĝo). 

In contrast, Maximum Likelihood estimation would involve specifying the 
conditional distribution for {(7j,¢4m,/Pj,t.Ct4+m,;/Ct);J = 1,2,... N} and maxi- 
mizing the likelihood subject to the constraint in (1.22) for each t. The latter 
would involve numerical integration in most cases and is consequently computa- 
tionally very burdensome.”° Furthermore, due to the inherent nonlinearlity of 
the model, Hansen and Singleton (1982) show that MLE is unlikely to yield un- 
biased inferences unless the distribution its correctly specified.2! The potential 
for this bias can be reduced by using a flexible functional form which is capable 
of approximating a wide class of probability density functions; e.g. see Gallant 
and Tauchen (1989). However this further adds to the computational burden. 


1.3.2 Evaluation of Mutual Fund Performance 


Mutual funds consist of a portfolio of financial assets administered by a fund 
manager.” The role of the manager is to vary the composition of this portfolio 
in response to any relevant economic or financial information to meet some spec- 
ified criterion. An investor can purchase shares in the fund and thereby acquire 
an asset whose rate of return is that of the portfolio. The incentive for invest- 
ing in the fund stems from the ability of the manager to acquire and efficiently 
process market information. However, in practice managers may misread their 
information or simply be the victims of unpredictable events. In this case the 
average investor may have received a better return by constructing his/her own 
portfolio based on a more restricted information set. Naturally there is consider- 
able interest in identifying which funds have yielded superior returns compared 
to some suitably chosen benchmark. This topic received some attention in the 
1970s, but interest has increased recently in response to the massive growth in 
assets managed by such funds in the U.S. In this section we describe a measure 
of fund performance proposed by Chen and Knez (1996). These authors actu- 
ally propose a number of related measures but at this time it is sufficient to 
focus on the simplest because it illustrates how the moment condition arises. 


20 One exception is the model studied by Hansen and Singleton (1983). They estimate 
the CRRA model described above by Maximum Likelihood under the assumption that 
({rj,t4+-m, /Pj,t};Ct+m,/ct) have a lognormal distribution. 

1 See Section 3.8. 

22 In practice funds may be administered by a team of managers, but for expositional 

convenience we refer to a single manager. 
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To begin with, it is useful to review two very fundamental results from 
finance. The “Law of One Price” states that any two investments with the 
same payoff in every state of the world must have the same price (e.g. see 
Ingersoll, 1987, p.59). The second fundamental result is deduced from this law. 
Chamberlain and Rothschild (1983) show that the Law of One Price implies 
a useful characterization of the relationship between the price and return of 
a financial asset. To flesh out this asset pricing equation, it is necessary to 
introduce some notation. Let X, be a vector of (N x 1) payoffs on N traded 
assets with nt? element nz Which is the time t return per time t — 1 dollar 
invested in asset n. Notice that each payoff , x,,4, can be interpreted as an asset 
with a price of $1. Chamberlain and Rothschild (1983) show that the Law of 
One Price implies there exists a unique scalar random variable d; = X{59 such 
that 


E[Xidi] = 1N (1.24) 


where ly is a N x 1 vector of ones and ôo is an N x 1 vector of constants. The 
variable d; is known as the stochastic discount factor.?? As we shall see, this 
asset pricing equation is central to Chen and Knez’s method. 

To evaluate the performance of a mutual fund it is necessary to have some 
benchmark. Since managers are essentially selling their ability to gather and 
process information, it is natural to compare the fund’s return to that achievable 
by an investor with no such information. This “uninformed” investor is taken 
to be an individual who holds a constant composition portfolio and hence never 
buys or sells assets in response to new information. Let the weights of this 
portfolio be collected into an N x 1 vector a whose nt? element is an. The 
return on such a passively held portfolio in period t is given by 


N 
Rila) = 5 AnTn,t, for D Qn = 1 (1.25) 
n=1 


Notice the weights have been normalized to sum to one and so R;(a) can be 
interpreted as the payoff achievable from an initial investment of $1. Also, the 
weight on £n, can be positive indicating a long position in the asset or it can be 
negative indicating a short position.?4 In contrast to the uninformed investor, 
the fund manager has the option of updating the composition of the fund’s 
portfolio and this is reflected by making the weights in his/her portfolio time 
dependent. Accordingly, the return on the fund is 


N N 
PS Citi, for Uae = (1.26) 
n=1 n=1 


23 Tt is also known as the “pricing operator” or the “pricing kernel”. 

24 An investor holds a long position in an asset if he/she owns units of the asset. An investor 
holds a short position in an asset if he/she has sold units of an asset that they did not own, 
say by borrowing it from a broker, and must return the borrowed units at some point in the 
future. 
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where the superscript m represents “mutual fund”. Again the weights, {6,4} 
this time, sum to one and so rj” represents the return on a $1 investment. 
Clearly, the manager has the option to leave the weights unchanged over time. 
However, if he/she follows this strategy then the fund does not increase the 
opportunity set for investors. In this case, Chen and Knez argue the manager 
has provided no service and so should receive a performance measure of zero. 
Furthermore, they argue that the manager should receive the same evaluation 
if he/she changes the weights of the fund’s portfolio but this only leads to a 
return which could have been earned by some passively held portfolio over the 
same period. A positive performance measure is only earned if the fund return 
exceeds that on any passively held portfolio over the same period. 

It is clearly desirable to identify which funds have positive performance mea- 
sures. It turns out to be most convenient to address this issue by reversing the 
question and seeking to identify funds with a zero performance measure. Chen 
and Knez (1996) show that the fund has a zero performance measure relative 
to the benchmark set of passively held portfolios in (1.25) if 


Mri”, dy) = E[r” X16] — 1 = 0 (1.27) 


To assess whether (1.27) is true, an estimate of ôọ is needed. Chen and Knez 
solve this problem by combining (1.24) and (1.27) into the augmented popula- 
tion moment condition 


E[Q:X!60] — 1n41 =0 (1.28) 


where Qi = (Xj, r} y. These equations provide a basis for the estimation of do. 
At first glance this appears to impose the very hypothesis that we wish to test. 
However, (1.28) is a vector of N + 1 moment conditions in N parameters and 
so the sample moments are not zero when evaluated at the estimated value of 
ĝo. As we shall see, this leaves scope for testing whether the data are actually 
consistent with (1.28) and hence the hypothesis that the fund has a performance 
measure of zero. 

This problem could be approached using Maximum Likelihood estimation. It 
would involve specifying the conditional distribution of Q, given the information 
available at time t-1 and assessing whether the estimated distribution satisfied 
the moment restriction in (1.27). However, this approach encounters both the 
types of problem described in Section 1.1. First, it is necessary to make a 
distributional assumption. A natural choice is normality but, unfortunately, 
this is not appropriate for stock return data; see Richardson and Smith (1993). 
To date there is no consensus on the appropriate choice; see Fama (1976, p.26) 
and Bollerslev, Engle, and Nelson (1994) for discussions of common features of 
the distribution of asset return data. Of course, unless the true distribution 
is used there is no guarantee that MLE yields more precise estimators than 
those obtained by GMM. Second, such estimation will involve significantly more 
parameters than the N involved in Chen and Knez’s approach and so will be 
more computationally intensive. 
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1.3.3 Conditional Capital Asset Pricing Model 


Harvey (1991) investigates whether the conditional Capital Asset Pricing Model 
(conditional CAPM hereafter) can explain the differences in the average returns 
across financial markets in industrialized countries. The original, or uncondi- 
tional, CAPM is one of the main models in finance and has received a lot of 
academic and non-academic attention; e.g. see Malkiel (1987). Its importance 
stems from its provision of an explicit relationship between the expected rate 
of return on an asset and the sytematic risk of holding that asset. In this con- 
text risk is measured by the variance of the asset return and derives from two 
sources. There is systematic risk which derives from the inherent uncertainties 
in the macroeconomy and there is unsystematic risk which is specific to the 
stock in question.?° Systematic risk is measured as the variance of the so-called 
“market portfolio”. This portfolio consists of all the assets in the market and 
so represents the most diversified portfolio it is possible to hold. By holding 
a suitably large portfolio the investor can diversify away the unsystematic risk 
and so he/she is only compensated for bearing the systematic risk in holding 
an asset. Systematic risk is present in all risky assets but to different degrees 
depending on the nature of the asset. Another attractive feature of CAPM is 
that it provides a measure of the degree of the systematic risk present in an 
asset; this measure is known as the investment beta. 

One weakness of the original CAPM is its implicit assumption that the level 
of systematic risk in an asset stays constant over time. Intuition suggests this 
risk should vary in response to changes in the macroeconomy and decisions 
made by the firm issuing the asset. This type of behaviour can be incorporated 
into the theory and yields the conditional CAPM. To introduce the model it is 
necessary to define first some notation. Let R; 4 be the return in period t on 
investing $1 in the asset in question in period t — 1, Rm, be the corresponding 
return on investing $1 in the market portfolio in period t — 1 and Rf, be the 
return in period t from investing $1 in the the risk free asset in period t — 1.76 
The excess returns on the asset and the market portfolio are defined respectively 
as rie = Rit — Ree and rm, = Rmx — Rf. The conditional CAPM implies 


E[ri tQ] = BitE[rm tl Qt-1] (1.29) 
where the conditional investment beta is 
Bit = Covlrit, Tm,t|Q¢-1]/Varlrme [Q1] (1.30) 


and Ef|.|Q:-1], Var[.|Q:-1] and Cov[.|Q:—1] denote respectively the expectation, 
variance and covariance conditional on an information set Q41." 

We can now return to the specifics of Harvey’s (1991) study. He examines 
whether the model in (1.29)—(1.30) can explain the variation in the returns 


25 Systematic and unsystematic risk are also refered to as market and idiosyncratic risk 
respectively. 

26 A risk-free asset is one whose return is known at the time of purchase. 

27 The original CAPM can be obtained from (1.29)—(1.30) by replacing the conditional 
expectations, variance and covariance by their unconditional counterparts. 
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across seventeen international stock markets. In this context r; becomes the 
excess return on holding the market portfolio for country i. The variable rm, 
is the excess return from holding a “world market” portfolio that is weighted 
combination of the returns on a variety of world-wide investments; see Harvey 
(1991) for details. To make the model operational it is necessary to specify the 
conditional means of the excess returns. To this end, let z+—1 be the vector of 
relevant economic and financial variables contained in Q;_,;. Harvey assumes 
that 


Elriz|Q%-1]) = 2% 14,0 


1.31 
Elrmt|Qe—-1] = OEE, l ) 


where ôm,o, {5:0} are unknown vectors of constants. The parameters to be 
estimated are ôm, and {6;,0;4 = 1,2,...17}. The estimation is based on two 
types of moment conditions: those implied by the specification of the conditional 
means, (1.31), and those implied by the conditional CAPM, (1.29)—(1.30). To 
present the moment conditions it is convenient to define 


Uit = Tit zZ4—10i,0 (1 32) 


/ 
Umt = Tmt — Z_19m,0 


The first set of moments comes from using iterated conditional expectations and 
Efui,2|Q¢-1] = 0 to show that 


E[u; tzt] — E| E[u; tzt-1|Q:-1]] = EE lu ¢|Q¢-1] 24-11] = 0 (1.33) 


Using a similar argument for um, and substituting from (1.32) yields the mo- 
ment conditions 


El(rit — %14,0)%-1] = 0 


1.34 
El(rm,t — 2%-148m,0)2t-1] = 0 ( ) 


for i=1,2,...,17. The second set of moment conditions comes directly from the 
conditional CAPM structure. The substitution of (1.30) into (1.29) plus some 
rearrangement yields 


Var|rm,t|Qi-1] E [rie |Q4-1] aik Covlrit, T m,t|Qi-1] E [mt |Qe-1] = 0 (1.35) 


Employing a similar iterated conditional expectations argument as in (1.33) and 
substituting from (1.31), it can be deduced that 


El{(?'m,t — %4-15m,0)°24-154,0 — (Fm,t — 24-15m,0) X 
(rit = zr_10i,0)2—10m,0}2t-1] = 0 (1.36) 


for i=1,2,...17, which constitute the second set of moment conditions used in 
estimation. 

This model can be estimated by Maximum Likelihood but, again, this ap- 
proach will encounter the problems mentioned in Section 1.1. The endogenous 
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variables are ry = (11,2,72,, ---717,t) mt)’. To implement MLE the conditional 
distribution for r; must be specified so that it satisfies both the conditional mean 
specification in (1.31) and the relationship between the conditional means, con- 
ditional variances and covariances in (1.35). Once again, the normal distribution 
is a natural first choice, but just as in the mutual fund example, these asset re- 
turns do not possess this distribution. Therefore, MLE under the assumption 
of normality is not necessarily more precise than GMM although it should lead 
to unbiased inferences provided the variances are correctly calculated.?8 MLE 
would be also slightly more computationally burdensome than GMM due to 
the imposition on the likelihood of the restrictions between first and second 
moments implied by the conditional CAPM. 


1.3.4 Inventory Holdings by Firms 


A firm can choose to use its output to meet current demand or hold it as in- 
ventory. There is a considerable literature in macroeconomics which seeks to 
explain the level of inventory holdings in the aggregate economy; e.g. see the 
survey by Blinder and Maccini (1991). These studies typically proceed by mod- 
elling the sales and inventories of a particular industry as if they are the outcome 
of decisions made by a single representative firm. One popular line of theory is 
based on the assumption that the representative firm uses inventories to smooth 
production levels. Although intuitively reasonable, the production smoothing 
model has had mixed success in explaining aggregate inventory behaviour; see 
Blinder and Maccini (1991). One response to this evidence has been to argue 
that firms smooth production costs and not levels. To test if either of these 
hypotheses can explain the data it is desirable to perform inference within a 
model which allows both types of behaviour. Eichenbaum (1989) presents such 
a model and uses it to analyze the inventory holdings in a number of two digit 
SIC industries in the U.S. This section outlines Eichenbaum’s model. 

The representative firm is assumed to face two types of costs: production 
costs and inventory holding costs. The production costs are assumed to be: 


Cot = Qi + (a0/2)QF (1.37) 


where Q is the firm’s output at time t and 1 is a random variable captur- 
ing stochastic shocks to the marginal cost of production. Since 1 is random, 
marginal costs are a random function and so there is an incentive for holding 
inventories to smooth production costs. However, if vy, = 0 then marginal cost 
is a deterministic function of output and so the only incentive for holding in- 
ventories is to smooth the level of production. The constant ao controls the 
slope of the marginal cost schedule: if ag is positive then the marginal costs 
are increasing with output and if ao is negative then the marginal costs are 


28 If the distribution is misspecified then in general the information matrix identity does 
not hold. This affects the formulae for the variances of the estimators; see White (1982) and 
Section 3.8. 


1.3 Five Examples of Moment Conditions 23 


decreasing with output. The inventory holding costs are assumed to be 
Cre = (50/2) (Li — V094)? + (0/2) 1? (1.38) 


where J;, S; are the inventories and sales of the firm at time t respectively.?? The 
constants Yo, 69 and, 79 are all nonnegative. The first term in (1.38) captures the 
cost to the firm of inventories deviating from the desired fraction of sales, yo 5S}. 
The second term in (1.38) captures the storage costs associated with holding 
inventories. The combination of the production and inventory costs yields the 
total cost function of the firm: 


Cr = Cot t+ Crt (1.39) 


By definition, sales, inventories and production are fundamentally related 
by: Qir = St +It— 1. Using this identity Q; can be explicitly eliminated from 
the model. Therefore, the firm is assumed to choose [41 and St+1 to maximize 
future discounted profits, denoted 


ES) (pers St4y — Cit) |O] (1.40) 
j=0 


where p; is the price in period t of the good produced by the firm, 6o is the 
discount factor and Q is the firm’s information set at time t. 
To characterize the optimal path for inventories and sales it is necessary 
to make some assumption about the random variable v. Eichenbaum (1989) 
assumes that 
Vz = poVit—1 + & (1.41) 


where E[e,|Qz-1] = 0, Var[e.|Qz_-1] < co and |po| < 1. In this case the Euler 
equation implies the following condition: 


Elhi+2(Wo) — pohisi(Wo)|Qe] = 0 (1.42) 


where 


hesi(Wo) = Inga — {Ao + (obo) He + 89 e-i + Stı — obg Se (1.43) 


and the parameters of the system are pọ and the cost function parameters Yọ = 
(ào, G0; 20) where po = (1 — d07%0/a0) and ào is a root from the second order 
autoregressive polynomial governing the time series properties of the inventory 
series; see Eichenbaum (1989) for details. Using a similar iterated expectations 
argument as in (1.23), it can be shown that 


El{hi+2(Wo) — poht+1 (Yo) z] = 0 (1.44) 


29 Eichenbaum includes a term melt where 714 is a parameter which depends on t. However 
this parameter is argued to be eliminated by a data transformation prior to estimation. So 
for expositional simplicity this parameter has been set to zero. 
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for any vector z E€ Q;. For example, Eichenbaum estimates the parameters 
using the lagged values of inventories and sales, {.5;_;, It—;;i = 1,2,...k}, in z. 

Maximum Likelihood would involve estimation of the bivariate vector autore- 
gressive system for (Sz, I+) subject to the nonlinear cross equation restrictions on 
the parameters implied by the model. This is likely to be more computationally 
burdensome with the exact degree depending on the choice of distribution. Un- 
fortunately, economic theory provides no guidance on this choice. Once again, 
unless the chosen distribution is correct then the resulting MLE’s are unlikely 
to have the anticipated optimal properties. 


1.3.5 Stochastic Volatility Models of Exchange Rates 


The preceding models have all been developed from economic theory. In some 
circumstances, it may be desired to capture the time series properties of an 
economic variable using a purely statistical model. An example of such a model 
would be the autoregressive integrated moving average (ARIMA) class devel- 
oped by Box and Jenkins (1976). However, ARIMA models are not particularly 
appropriate for many financial assets because they do not allow the conditional 
variance to change over time. This has led to considerable interest in statistical 
models which can capture this type of behaviour. The most prominent of these 
models are the autoregressive conditional heteroscedasticity (ARCH) models 
introduced by Engle (1982), which have been applied very widely in finance, 
see the survey by Bollerslev, Chou, and Kroner (1992). More recently, a sec- 
ond class is receiving considerable attention and these are known as stochastic 
volatility models; see the survey by Ghysels, Harvey, and Renault (1996). 

In this section we describe the stochastic volatility model used by Melino and 
Turnbull (1990) to analyze daily exchange rates. The model has its origins in a 
stochastic differential equation for the evolution of the exchange rate over time. 
However, we focus directly on the discrete time stochastic process which is used 
to approximate this underlying continuous time process. Let y(7) denote the 
exchange rate at time 7 and assume that the exchange rate is observed at times 
{T1, T2,... Tr}. These observations are not at evenly spaced intervals because 
there are days on which no trading occurs, such as weekends and holidays. To 
accomodate these effects, it is useful to denote the distance between observations 
by dt = Te — Te—1, and the minimum distance by d = min;(d;). The discrete 
time approximation takes the form 


YT) = aod, + (1 + Bods)y(te-1) 


(1.45) 
+ @(te-1)y(te-1)9/2dy/7e(74) 


where the latent process x(7;) is generated by 


Injam] = dod + (1 + nod)In[a(m — d)] + Cod! ulr) (1.46) 


Beep aie =|) (1.47) 
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Given that the model includes a distributional assumption, it is natural to use 
Maximum Likelihood. However, the evaluation of the conditional likelihood at 
time t involves a T-dimensional numerical integration which is computationally 
extremely burdensome — if not infeasible — on many currently available computer 
systems. However the normality assumption implies various population moment 
conditions which can form the basis of GMM estimation of the parameter vector 
o = (ao, Bo, o, No, Co, po)-30 For example, Melino and Turnbull (1990) show 
that the following population moment conditions hold:*! 


Eļ(w:(80)] = 
E(w? (90)] — exp[2pe + 203] = 
E[w?(90)| = 
E[w?(90)] — 3exp[4ue + 807] = 


El|we(90)l] — (2/7) ezplus + 0.502] = 


ee) 


E|\w;(9o)|°] — 2(2/m)!/2exp[8p2 + 4.502] = (1.48) 
E||w:(9o)|we(o)] = 
Ew: (90) we-j(90)] = 
E||w:(90)we—j(90)|] — 41,5(80) + 42,5(00) = 
E|\we(9o)|we-j(90)] — mj (Ao) = 
E[w; (00) w7_;(40)] — n;(80) = 
for j = 1,2,... where 
y(t) — aodi — (1 + Bods)y(t-1) 
Pepa u A S 
and 
lialo) = (2/7) P expus +0201 + (1 + mod)’) — 0.5060 + nd)? 07] 
lajlo) = (2/m)/?poGod'/?(1 + nod) (1 — 28 (pood? (1 + od)?~") 
x eaxp[2He + 03 (1 + (1+ nod) )] 
m;(00) = (2/m)'/?po¢od!/?(1 + 0d)? exp|2us + 02 (1 + (1 + 704)’)] 
nj(80) = {4po¢Gd(1 + mod)?9- + lexp[Ape + 403 (1 + (1 + nod)’)] 
Me = —60/No 
oz = Gd/[l— (1+ nd)’] 


and ®(.) denotes the cumulative distribution function of a standard normal 
random variable. 


30 In their estimations, Melino and Turnbull (1990) fix the value of yo and so we omit this 
term from ĝo. See Section 9.4 for further discussion of this issue. 

31 These expressions are not actually presented in the published version of Melino and 
Turnbull’s paper but are contained in an unpublished appendix by Ken Vetzal which was 
kindly sent to the author by Angelo Melino. 


26 Introduction 


1.4 Review of Statistical Theory 


To develop the theory of GMM estimators it is necessary to appeal to various 
statistical concepts and results. This section briefly reviews some basic ideas 
which are used throughout the text; other results are explained as they be- 
come needed. A more thorough review of these topics can be found in many 
econometric or statistical texts such as Davidson and MacKinnon (1993), Fuller 
(1976), Judge, Griffiths, Hill, Lutkepohl, and Lee (1985), and, for more rigorous 
treatments, Davidson (1994) and White (1984). All the results are based on 
asymptotic, or in other words, large sample theory. In the majority of our anal- 
ysis, this type of analysis involves an examination of what happens to various 
statistics as the sample size, T, tends to infinity. Asymptotic is the adjective 
derived from “asymptote”, the noun for the line which acts as a limit for a 
curve. According to the American Heritage Dictionary, asymptote comes from 
the Greek “asumptotos” in which “a” means not, “sun” means together and, 
“ptotos” means likely to fall. In spite of these unpromising origins, asymptotic 
analysis is used to approximate the behaviour of statistics in large, but finite, 
samples. An important secondary issue is the accuracy of this approximation 
and this is discussed in detail in Chapter 6. 

Before reviewing this theory, it is useful to emphasize an item of notation. In 
the preceeding sections, it has been shown that statistical or economic models 
imply a set of population moment conditions involving the parameters and the 
data. It is important to realize that these moment conditions only hold at the 
true value of the parameters. A zero subscript is used to emphasize the true 
value of the parameter vector. This notation is neccessary to avoid ambiguity in 
the formal discussion of statistical estimation. As we have seen in Section 1.2, 
GMM estimation involves finding the value of the parameters which minimize 
Qr(@) given in (1.18). Formally, this will involve considering the behaviour of 
Qr(@) over a set of possible values for 6, known as the parameter space and 
denoted ©. The notation 0 is reserved to refer to an arbitrary element of ©. 
As above, the notation Êr is used to denote the parameter estimator based on 
a sample of size T. Both 6) and Or are individual elements of ©. 

The IV estimator in (1.14), âr, can be used to illustrate several key features 
of asymptotic analysis of GMM estimators. It is of interest to analyze what 
happens to âr as T — œ and for this we require the concept of convergence in 
probability. This analysis is facilitated by analyzing the limiting behaviour of 
the sums in the numerator and denominator separately using the Weak Law of 
Large Numbers and then taking the ratio of these limits to deduce the limiting 
behaviour of âr. This last step can be justified using Slutsky’s Theorem. In 
particular, it is of interest to examine whether the estimator converges in prob- 
ability to the true population value of that coefficient; if so, then it is said to be 
consistent. For the purposes of constructing confidence intervals and hypothesis 
tests about ag, it is necessary to find some transformation of âr which con- 
verges in distribution to a known probability distribution. For our purposes the 
appropriate transformation is T!/?(@ — ag) and this statistic can be shown to 
converge to a normal distribution as T — œ using the Central Limit Theorem. 
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In the remainder of this section these and certain other statistical concepts are 
defined more formally. It is most convenient to split the discussion into two 
parts. The first part deals with the properties of random sequences such as 
convergence in probability or distribution which can be discussed in abstract. 
The second part deals with results such as the Weak Law of Large Numbers 
and Central Limit Theorem for which it is neccessary to place restrictions on 
the nature of the random variables in the model. 


1.4.1 Properties of Random Sequences 


To fix ideas, consider the case where the sequence is deterministic and so not 
random. Let {hr; T = 1,2,...} be a sequence of real numbers. If this sequence 
has a limit, h, then this is denoted by 


lim hr =h 


—0o 


This implies that for every e > 0 there is a positive, finite integer T, such that 
lag -—hl <e for T>e (1.50) 


Note (1.50) does not imply |hr — h| becomes monotonically smaller as T in- 
creases. However, it does tell us that |Ar — h| is smaller than e for all T > Ty, 
and so conveys a sense in which hr is becoming closer to h as T tends to infinity. 
Often, it is useful to characterize the behaviour of a sequence with respect to 
T regardless of whether it converges or not. This can be achieved using large 
and small orders of magnitude. The sequence is said to be of large order of 
magnitude cr if there exists a real number m such that |hr|/cer < m for all 
T. This is denoted by hr = O(cr). The sequence is said to be of small order 
of magnitude cr if the limit of hr/cr is zero as T — oo. This is denoted by 
hr = o(cr). 

In these definitions, the deterministic nature of the sequence is reflected in 
the way it can be stated with certainty that hr satisfies the property in question. 
With sequences of random variables it is necessary to attach a probability to 
such events occuring. This leads us to the concept of convergence in probability. 
For notational convenience the results are also stated in terms of “hr” but this 
is now a random variable. 


Definition 1.3 Convergence in Probability 
The sequence of random variables {hr} converges in probability to the random 
variable h if for alle > 0 


jim Pilhr -h| <e] =1 


In this case h is known as the probability limit or plim of hr and is denoted by 
plimhr =h or hr Èh. 
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The definition of convergence in probability implies that for each e > 0 there 
exists a finite T, such that the probability of |hr — h| < € is arbitrarily close 
to one for all T > Tę. So convergence in probability can be recognized as the 
natural extension of the concept of of convergence for deterministic sequences. 
The concepts of order of magnitude can be similarly extended to sequences of 
random variables. 


Definition 1.4 Orders in Probability 


1. The sequence of random variables {hr} is said to be of large order in 
probability cr if for every e > 0 there exists positive real numbers Mme and 
T. such that Pi|hr|/er > Mm] < € for all T > Te. This is denoted by 
hr = Op (er). 


2. The sequence of random variables {hr} is said to be of small order in 
probability cr if plim(hr/cr) =0. This is denoted by hr = 0,(cr). 


Both types of order in probability are very useful in asymptotic analysis because 
they can be linked to consistency and convergence in distribution as will be 
shown below. However, first it is necessary to extend the notion of convergence 
in probability to vectors and matrices. A vector (or matrix), hr, is said to 
converge in probability to h if the i” (or (i, j)t*) element of hr converges in 
probability to the it” (or (i, j)") element of h for all i (or (i,7)). The extension 
of orders in probability is a little more tricky because in general there is no 
guarantee that all elements of a random vector or matrix are of the same order. 
However, in the majority of our analysis this will be the case and so we use the 
notation hr = O,(cr) or hr = 0,(cr) to indicate that all the elements of the 
vector or matrix individually satisfy the stated order in probability. 

In many cases, our analysis involves the probability limits of functions of 
random vectors and so the following result is going to be very useful. For 
convenience the result is stated in terms of random vectors; however, the same 
result applies for random variables and random matrices. 


Lemma 1.1 Slutsky’s Theorem’? 


Let {hr} be a sequence of random vectors which converges in probability to 
the random vector h and let f(.) be a vector of continuous functions then 


plimf(hr) = f(h). 


In many cases hr = br, a GMM estimator of some unknown parameter vector 
0o, and so it is of interest to characterize the limiting relationship between 
estimator and estimand. 


Definition 1.5 Consistency of an Estimator 
Let {ôr} be a sequence of estimators of the unknown parameter vector of con- 
stants 09 then Êr is said to be a consistent estimator of 0o if plim bp = = bo. 


32 This theorem is named after Evgenii Slutsky (1880-1948), a Russian mathematician 
who first proved a version of this result. He made numerous other contributions to statistics 
including early work which helped to lay the foundations of stationary time series theory. He 
also made contributions to economics particularly in the area of demand analysis including 
the eponymous Slutsky effect and Slutsky matrix. 
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If plim Êr Æ Oo then the estimator is said to be inconsistent. Notice that the 
consistency of Ôr for 6o implies Êr — 09 = o,(1). Consistency is a rather weak 
property because it merely states that as T — oo the estimator converges in 
probability to the true value. It is perfectly reasonable to question how much 
comfort can be drawn from this property since it implies the true value is only 
recovered in the limit. However, earlier it was observed that convergence also 
implies a sense in which Êr becomes closer to 0o as T increases. This is a more 
intuitively appealing property; certainly we would be concerned if the estimator 
is inconsistent and so not converging in probability to the true value! 

Convergence in probability implies that the difference between Ôr and 69 
disappears with probability one as T — oo. Therefore in the limit 07 and 6o 
are essentially identical. In deriving the asymptotic distribution of the GMM 
estimator, it will be convenient to appeal to the weaker notion of convergence in 
distribution. For this definition we revert to the more general notation because 
this concept is not just applied to estimators in our analysis. 


Definition 1.6 Convergence in Distribution 

The sequence of random vectors {hr} with corresponding distribution functions 
{Fr(c)} converges in distribution to the random vector h with distribution func- 
tion F(c) if and only if there exists T, for every e such that |Fr(c)— F(c)| < € 


for T >T, at all points of continuity {c}. This is denoted by hr Sh 


The distribution of h is known as the limiting (or asymptotic) distribution of 
hr. If hr converges in distribution then hr = O,(1). However, in practice, 
our focus is not just on establishing that hr converges in distribution, but also 
on characterizing the exact nature of its limiting distribution. We now turn to 
various results which facilitate this type of analysis as well as the other aspects 
of asymptotic behaviour described above. 


1.4.2 Stationary Time Series, the Weak Law of Large Num- 
bers and the Central Limit Theorem 


The asymptotic theory in this book revolves around analyses of the limiting 
behaviour of sums of random variables using the Weak Law of Large Numbers 
and Central Limit Theorem. For these results to apply, it is necessary to place 
restrictions on the nature of the random variables in the model. Various ap- 
proaches can be taken but, throughout this book, we follow Hansen’s (1982) 
original treatment involving stationary time series. In passing we note that this 
assumption is employed in nearly all the studies listed in Table 1.1.33 


Definition 1.7 Strictly Stationary Processes 
Let N(T) = {1,2,...T} and {vs;t EN(T)} be a set of random vectors. Define 
{t1,t2,...,;tn} to be a subset of N(T). The set of random vectors are said to 


33 See Appendix A for a brief discussion of the GMM framework under alternative assump- 
tions about the data generation process. 
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be strictly stationary if the joint probability distribution function, F(.), of any 
subset of {u,} satisfies: 


F (Ut, Utz, eee Ut, ) = F (vt +o, Uta+c3 soe Ut, +c) 


for any integer n and integer constant c such that {t1 +¢,ta+c¢,...tn +c} is a 


subset of N(T). 


One consequence of this definition is that all moments of the process are 
constant over time, provided they exist. The imposition of strict stationarity 
is insufficient by itself to permit the proof of Weak Law of Large Numbers and 
the Central Limit Theorem. In addition restrictions need to be placed on the 
dependence structure and certain higher moments of the series. Examples of 
such conditions on the dependency are ergodicity or various types of mixing 
condition. Both involve rather sophisticated mathematical ideas and so for the 
present, we just add the caveat “subject to certain regularity conditions” in the 
statement of the following results. However, we return to these conditions in 
Chapter 3. 


Lemma 1.2 Weak Law of Large Numbers (WLLN) 
Let {vt =1,2,...,T} be a sequence of strictly stationary random vectors with 
Eu] = u then subject to certain regularity conditions 


Lemma 1.3 Central Limit Theorem (CLT) 
Let {u,;t = 1,2,...,T} be a sequence of strictly stationary (sx 1) random vectors 
with E[v,] = u then subject to certain regularity conditions 


T 
T2 Y (v, — u) S N(0, £) 
t=1 


where N(0, £) denotes the s dimensional multivariate normal distribution with 
mean 0 and positive definite covariance matrix 


T 
. -1/2 
y= jm Var[ T7" 2 — p)| 
The matrix X is known as the long run covariance matrix of v to distinguish it 
from the the contemporaneous covariance matrix E[(v; — u) (ve — uy’). 

To conclude this section, it is useful to present one final result which is 
invoked frequently.3+ 


34 This result is proved in Fuller (1976, p.199). 
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Lemma 1.4 The Limiting Distribution of Random Linear Functions 
of Vectors Converging to a Normal Distribution 

Let {Mr;t = 1,2...,T} be a sequence of random matrices which converges in 
probability to M, a matrix of constants, and {hr,;t = 1,2...T} be a sequence 
of random vectors which converges to a N(0, £) distribution then 


Mrhr Ż N(0, MXM’) 


1.5 Overview of Later Chapters 


This chapter has provided the flavour of GMM and placed the technique in the 
context of both the econometrics and statistics literatures. In the next chapter, 
we introduce the key elements of the GMM framework using the IV estima- 
tor in the static linear model. This approach keeps the technical details to a 
minimum and allows the reader to appreciate more readily the main ideas and 
intuitions. The issues addressed here are: identification; the asymptotic proper- 
ties of the estimator; the iterated GMM estimator; and a decomposition of the 
population moment conditions into identifying and over-identifying restrictions 
which leads to the overidentifying restrictions test amongst other things. The 
following chapters build from these foundations to present the GMM framework 
for estimation and inference which encompasses the majority of the models in 
Table 1.1. 

Chapter 3 addresses GMM estimation and the asymptotic properties of the 
estimator in correctly specified nonlinear dynamic models. The topics cov- 
ered are: identification, calculation of the estimator by numerical optimization 
routines, consistency, asymptotic normality, covariance matrix estimation and 
iterated GMM estimators. Formal proofs are presented for the main statistical 
results. However, the issues are also illustrated using the consumption based as- 
set pricing model to provide guidance on the practical implementation of GMM 
as well. All this discussion takes the data, parameter vector and population 
moment condition as given. In some cases, the researcher may desire to impose 
a normalization on any one of these three features. Therefore, the impact of 
normalization is also discussed and this motivates a variant of GMM known as 
the continuous updating estimator. This chapter concludes with a more formal 
presentation of how many seemingly different estimators can be regarded as 
special cases of GMM. 

Chapter 4 explores the consequence of misspecification for the statistical 
properties of the GMM estimator. Particular attention is focused on convergence 
in probability of the estimator, covariance matrix estimation and the the limiting 
distribution of the estimator. A comparison with the results in the previous 
chapter reveals that misspecification has a fundamental impact on the large 
sample behaviour of the GMM estimator and its associated statistics. These 
differences motivate the use of the model specification tests. 

Chapter 5 examines a wide variety of hypothesis tests which have been pro- 
posed within the GMM framework. The main focus is on the following: the 
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overidentifying restrictions test, tests for the validity of a subset of population 
moment conditions, tests of whether the parameter vector satisfies a set of re- 
strictions, and structural stability tests. However there is also some discussion 
of Hausman-type tests, non-nested hypothesis tests and conditional moment 
tests. 

All the preceding analysis is based on asymptotic theory. Chapter 6 ex- 
plores how well this theory approximates finite sample behaviour. If attention 
is reduced to a very specific class of models then it is possible to examine this 
question analytically. However, for more general specifications, it is necessary to 
resort to computer based simulation studies. Both approaches are reviewed in 
Chapter 6, and the results from each are synthesized to indicate what aspects of 
the specification appear to effect the quality of the asymptotic approximation to 
finite sample behaviour. This chapter begins with a discussion of the available 
asymptotic results on the consequences of increasing the number of the moment 
conditions upon which the estimation is based. 

The asymptotic theory in Chapters 3 and 5 takes the population moment 
condition as given. However, the evidence reviewed in Chapter 6 indicates 
that the quality of the asymptotic approximation can be sensitive to the choice 
of moment condition. Chapter 7 reviews the literature on moment selection. 
The discussion falls into two parts. The first part summarizes available results 
on the optimal choice of instrument in the special case of GMM known as 
generalized instrumental variables (GIV). The second part describes a number 
of information criteria that have been proposed as a basis for moment selection. 

In the face of evidence that the asymptotic theory from Chapters 3 and 5 can 
provide a poor approximation, it is natural to seek alternative approximations 
that permit more reliable inference. Three such approximations are reviewed in 
Chapter 8. These are: the use of the bootstrap, an asymptotic theory derived 
under the assumption that the population moment condition provides weak 
identification, and an asymptotic theory for the case in which the long run 
variance is estimated by a class of estimators that are random in the limit. 

All the methods and issues described above are illustrated empirically using 
the consumption based asset pricing model in Section 1.3.1. Chapter 9 presents 
empirical results for the other four examples in Section 1.3 that illustrate various 
aspects of the GMM inference framework. 

Finally, Chapter 10 briefly reviews some other estimation techniques that 
are closely related to GMM. These are Simulated Method of Moments, Indirect 
Inference, Efficient Method of Moments and the method of Empirical Likelihood. 


2 


The Instrumental Variable 
Estimator in the Linear 
Regression Model 


One of the main advantages of GMM is that it can be used to perform inference 
about the parameters in nonlinear dynamic models. However, as might be 
anticipated, both nonlinearity and dynamics create a number of technical issues 
which need to be addressed in the statistical analysis. These issues can obscure 
the essential structure of the method for those readers less familiar with this 
type of analysis. Therefore, in this chapter, we introduce the key elements of 
the GMM framework using the IV estimator in the static linear model. This 
approach enables us to keep the technical details to a minimum and allows the 
reader to appreciate more readily the main ideas and intuitions. Those readers 
already familiar with the basic GMM framework may prefer to pass over this 
chapter. 


Section 2.1 specifies the model and discusses the connections between the 
population moment condition and the condition for parameter identification. 
Section 2.2 derives the estimator and describes a fundamental decomposition of 
the population moment condition into “identifying” and “overidentifying” re- 
strictions. Section 2.3 considers the asymptotic properties of the estimator and 
the estimated sample moment. In the course of this discussion, it emerges that 
a consistent estimator of the long run variance of the sample moment is required 
for inference procedures based on the parameters or estimated moments. There- 
fore, Section 2.3 also contains a brief discussion of how such a covariance matrix 
estimator can be constructed in this simple model. Section 2.4 examines the 
optimal way in which to use the information in the population moment condi- 
tions, and introduces the “two step” and iterated GMM estimators. Section 2.5 
discusses the consequences of specification error, and introduces the overiden- 
tifying restrictions test statistic which is the standard diagnostic for the model 
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specification within the GMM framework. Section 2.6 contains a summary of 
the chapter. 


2.1 The Population Moment Condition and 
Parameter Identification 


Consider the linear regression model 
Ye = 2400 + Ut, PS 1528 2 (2.1) 


in which zz is a (p x 1) vector of observed explanatory variables for the observed 
variable y+, and us is the unobserved error term. The (p x 1) vector ĝo is an 
element of the parameter space, ©, a subset of the p-dimensional Euclidean 
space RP. The instruments are contained in the (q x 1) vector z+. To facilitate 
the discussion, it is useful to define: u;(0) = y+ — x40. Notice that us(0o) = uz. 
As the analysis progresses, certain restrictions need to be placed on the variables 
but these will only be imposed as they become necessary to emphasize their role. 
At this stage, we only require the following. 


Assumption 2.1 Strict Stationarity 
The random vector vz = (x4, 24, ue)’ is a strictly stationary process. 


This assumption implies that any population moments of v; are independent of 
t. 
Estimation of ĝo is based on the following population moment condition. 


Assumption 2.2 Population Moment Condition 
The (q x 1) vector z satisfies: E[z,uz(A0)| = 0. 


This type of condition is sometimes refered to as an “orthogonality condition” 
because it states that z+ is statistically orthogonal to uz. At this stage, it may be 
useful to relate this structure back to one of the models encountered in Chapter 
1. 


Example: Wright’s (1925) Demand Equation 

It can be recalled from Section 1.2 that Wright (1925) proposed IV as a method 
for estimating the parameters of demand and supply equations. His original 
derivation was based on the Method of Moments principle and so its implemen- 
tation only required the researcher to find one instrument z? which satisfied the 
moment condition in (1.13). Two candidates were suggested: an input price, 
now denoted z£, and yield per acre, z4?. However, rather than choose between 
these two instruments arbitrarily, intuition suggests that a far more appealing 
strategy is to base estimation on both. This leads to the (2 x 1) population 
moment condition 


Elza (a > aop: )] =0 
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where z; = (z8, 22)’. It can be recognized that this population moment condi- 


tion fits within the framework of Assumption 2.2 once q+, pe and ap are substi- 
tuted for y+, x+ and ĝo respectively in (2.1). ° 


While Assumption 2.2 specifies the information upon which estimation is 
based, the resulting estimation is only going to be successful if this population 
moment condition provides enough information to determine 6) uniquely. In 
reality, this is not guaranteed to be the case. The parameter vector ĝo is only 
uniquely determined by the moment condition if E[z:uz(@)] A O at all other 
values of 0. In this case 99 is said to be identified by the population moment 
condition. This condition is easily stated but, in this form, provides little guid- 
ance about the circumstances under which it holds. Fortunately, it is possible to 
obtain a more transparent version. With some simple rearrangement, it follows 
that 


Elz,u;(0)] = E[zeue(00)] + Elz] (00 — 9) (2.2) 
and this combined with the population moment condition implies 
Elzzuz(9)| = Efax] (0o — 0) (2.3) 


Therefore 69 is identified if E[z,x7/](@) — 0) Æ 0 for all 0 A 0o. Equation (2.3) 
is a system of linear equations in 0) — 0 and so this property is guaranteed if 
the rank of E[z,2;] is p; for example see Strang (1988, p.96). This gives the 
following condition for identification. 


Assumption 2.3 Identification Condition 
rank{ E|z,2}]} = p. 

The population moment and identification conditions provide the essential 
information upon which estimation of ĝo is based. In view of its fundamental 
importance, it is worth briefly pausing to reflect on the exact nature of this 
information. Assumptions 2.2 and 2.3 imply there is a unique value in the 
parameter space at which E[z,uz(6)] equals zero. In our discussion we have 
denoted this value by 69 — however, nothing has been said about this value 
beyond its uniqueness. 

Before proceeding to define the GMM estimator, it is worth briefly con- 
sidering how parameter identification can fail. There are two basic scenarios. 
First, failure can occur because there are fewer moment conditions than param- 
eters. In terms of the mathematics, this implies that rank(E[zx}]) < q < p. 
Intuitively, the problem here is that it is impossible to extract the p pieces of in- 
formation needed to determine 69 uniquely from less than p population moment 
conditions. Secondly, failure can occur even when q > p because collectively 
the population moment conditions still do not provide enough information to 
uniquely determine 69. This second scenario is best understood by considering 
a simple example. Suppose p = q = 2; let x = (£1, 22,1)’, Zt = (21.2, 22,4)’ and 
;,90,, denote the it? elements of 0, 0o respectively. In this case, 


_ | Elziptit] Elzreve,e 60,1 — 01 
SERME Ef|z2 t21] Efez, t22] 90,2 — 02 ee) 
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For this model, identification requires the rank of E[z,x7/] to be two. Failure 
can occur because either E[z;x;] contains a row of zeros or because the first row 
is a multiple of the second. Each of these can be interpreted in terms of the 
statistical model as follows. 


e Case 1: E[z,2x}] contains a row of zeros 

Suppose Ez, +x] = (0,0) and E[z24x;] = (mi,mz2). In this case E[z1 4 
uz(@)] = 0 regardless of the value of 69—0, and so it provides no information 
on fo. The other moment condition provides some information but not 
enough to uniquely determine 09. For example if m; Æ 0 for i = 1,2 then 
E|z2,2u+(0)| = 0 for any ĝo — 0 of the form (c,—mic/mz2). Identification 
fails because an insufficient number of elements of E[z;u,(@9)| = 0 provide 
information about ĝo. 


e Case 2: One row of Elz,x;] is a multiple of the other 
Suppose E[z1 424] = kE|z2 4x4] = (mı, M2) for some constant k and for 
the sake of argument m; Æ 0 for i = 1,2. In this case E[z,uz(0)] = (0, 0)’ 
for any 09 — 0 of the form (c, —mic/mz2) and, once again, ĝo is not uniquely 
determined by the population moment condition. So identification fails 
because both elements of E[zru:(00)] = 0 provide exactly the same infor- 
mation about 6. 


From this discussion, it is clear that parameter identification and the rela- 
tionship between p and q are important. It is therefore useful to introduce the 
following terminology. If the identification condition fails then the parameter 
vector ĝo is said to be under-identified (or unidentified) by the population mo- 
ment condition. If the parameters are identified and q = p then the parameters 
are said to be just-identified by the population moment condition. Notice in this 
case there are just p sources of the p pieces of information needed to identify 
0o. Finally, if the parameters are identified and q > p then @ is said to be 
over-identified by the population moment condition. In this case there are more 
than p sources of the p pieces of information needed to identify 6o. 

For the remainder of this chapter, it is assumed that the parameters are 
either just- or over-identified. In Section 8.2, we examine the kind of problems 
which can occur if the parameters are under-identified or close to being so, a 
scenario termed “weak identification”. 


2.2 The Estimator and a Fundamental 
Decomposition 


Section 1.2 introduced the generic definition of the GMM estimator. To spe- 
cialize this definition to our current context, it is most convenient to work with 
matrix notation rather than summations. Therefore we start by introducing the 
following definitions. Let y be the (T x 1) vector whose t” element is yn; X 
be the (T x p) matrix whose tt” row is x}; Z be the (T x q) matrix whose t” 
row is z{; u be the (T x 1) vector whose t” element is ws; and u(@) = y — X06. 
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Using this notation to make the appropriate substitutions into (1.18), the GMM 
minimand for this model is: 


Qr(0) = {T tu(0) Z}Wr{T + Z'u(0)} (2.5) 
Following Definition 1.2, the GMM estimator of 69 is defined as 
Êr = argmingce Qr(0) (2.6) 


where the notation “argmin” is a mathematical shorthand for the value of the 
argument — 0 — which minimizes the function — Qr(0). Since, 


Qr(0) = T ?’?{y'ZWrZ'y t+ OX! ZW Z' XO — 2y'ZWrZ' XO} 
the first order conditions for the minimization in (2.6) are! 
(TOX'AWHTAZ'y) = (TAX Z)WHT 12 Xe, (2.7) 
So provided (T~!X'Z)Wr(T~1Z'X) is nonsingular, the estimator is given by 
Êr = {(T1X'Z)We(T12Z'X)} HT X'Z)Wr(T 7! Z'y) (2.8) 


It can be recalled from Section 1.2 that GMM (or Minimum Chi-Square) 
were introduced to circumvent the problems encountered with Method of Mo- 
ments. That earlier discussion emphasized the way in which GMM generalized 
the Method of Moments principle. However, the relationship between the two 
estimation principles is far more subtle. Although the GMM estimator is defined 
via the minimization in (2.6), it is actually the solution to the first order condi- 
tions in (2.7). With a simple rearrangement, these conditions can be rewritten 
as 

(T71X'Z)WrT ~! Z'u(ĝr) = 0 (2.9) 


This characterization of the first order conditions reveals that Or is identical to 
the Method of Moments estimator based on, 


E|xiz;,|W E[zru:(00)] = 0 (2.10) 


This Method of Moments interpretation is useful because it makes explicit the 
relationship between the estimator and the population moment condition in 
Assumption 2.2. Minimization of Qr(@) with respect to 0 amounts to estimation 
based on the information that the p linear combinations of E[z,uz(00)] given 
in (2.10) are zero. Notice that this interpretation implies that if q = p then 
Method of Moments and GMM are equivalent because in this case E[x,z;]W is 
nonsingular and so (2.10) implies E[zru:(60)] = 0.2 In this case, the weighting 
matrix plays no role and the GMM estimator is given by, 


Ê= (T Z' X) t (T! Z'y) (2.11) 


1 See Dhrymes (1984) [Proposition 95 and Corollary 28, p.110-111]. 

2 Recall that a similar observation is made in Section 1.2 regarding the equivalence of 
Method of Moments and Minimum Chi-Square. 

3 Notice that this solution is consistent with (2.8) because if p = q then 
{(T-1X'Z)Wr(T1Z'X)} =} = (7712 X)- Wy (T71X'Z)—! - subject to the existence 
of the stated inverses. 
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However if g > p then no such reduction is possible, and the choice of weight- 
ing matrix is important because it determines the exact nature of the linear 
combinations of E[z,uz(00)] set to zero in (2.10). 

This Method of Moments interpretation also indicates that if g > p then 
there is a difference between the information with which we began, Assumption 
2.2, and the information actually used in estimation, equation (2.10). To charac- 
terize the relationship between the two, it is useful to develop an alternative rep- 
resentation for (2.10) which has the same dimension as the population moment 
condition. For this part of the analysis, it is more convenient to work with a non- 
singular tranformation of the population moment condition, W!/?E[z;:u+(60)], 
where W!/? satisfies W = W!/2,W1/2.4 So we begin by rewriting (2.10) as 


F'W*/? E[z:uz(80)] = 0 (2.12) 


where F’ = Eļ[z,z!]W1/?. Equation (2.12) indicates that GMM estimation is 
based on the information that W!/? E[z,u;,(0p)] lies in the null space of the (px q) 
matrix F”. Sowell (1996) observes that this condition is identical to the restric- 
tion that the least squares projection of W!/? E[z,u;] onto the column space of 
F is zero. By this logic, we obtain the following alternative representation of 
the information used in GMM estimation, 


F(E'F)1F'W"7E[zu(80)| = 0 (2.13) 


While (2.13) consists of q equations in E[ztu:(00)], not all of them are linearly 
independent because rank{ F(F’F)~1F’} = rank{F} < p. Notice that we have 
already assumed this rank equals p to ensure identification. The re-emergence 
of this quantity here provides an alternative perspective on the fundamental 
connection between identification and estimation: the p parameters are only 
identified if the estimation is based on p linearly independent equations. In 
view of this connection, Sowell (1996) refers to the elements of (2.13) as the 
identifying restrictions associated with GMM estimation. It follows immediately 
from (2.13) that the part of W!/?E[z,u;(09)] unused in estimation is given by 


(I, — F(F'F) 1 F')W'/7 E[z;uz(00)| = 0 (2.14) 


Equation (2.14) constitute a set of rank{I, — F(F’F)~|F"} = q — p linearly in- 
dependent equations in W!/? E[z;u;(09)]. Hansen (1982) refered to the elements 
of (2.14) as the overidentifying restrictions. 

This decomposition is fundamental to the analysis of GMM estimators of 
overidentified parameter vectors and so it is worth emphasizing its structure. 
The (q x 1) vector of population moment conditions is decomposed into p iden- 
tifying restrictions and q — p overidentifying restrictions. The identifying re- 
strictions represent the part of the population moment condition used in es- 
timation and the overidentifying restrictions are the remainder. Most impor- 
tantly, these two components are linearly unrelated because F(F’F)~'F’{I, — 
F(F'F)-1F'}=0) 


4 There must be a (q X q) nonsingular matrix W1/2 which satisfies this identity because 
W is positive definite from Definition 1.2; see Dhrymes (1984) [Corollary 14, p.73]. 
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So far, these components have been defined in terms of population quantities. 
We now consider the extent to which this behaviour is mirrored by their sample 
counterparts. Since the identifying restrictions represent the information upon 
which estimation is based, it would be anticipated that their sample analog holds 
at Êr. This is easily verified to be the case because the first order conditions in 
(2.9) imply 

Pr(FEhFr) FLW PT Z'ulôr) = 0 (2.15) 


where Fh = (T~1X'Z) wil" and Wr = 12 wW., In contrast, the overi- 
dentifying restrictions are ignored in estimation and so it would be anticipated 
that they do not generally hold in the sample. Again, this is the case. However, 
they do play a similar remainder role in the sample. From (2.15) it follows that 


(Iq — Fr(FEhFr) FAW? T Z'ulôr) = WPT Zubr) (2.16) 


and so the estimated transformed sample moment is just the sample analog to 
the function of the data in the overidentifying restrictions. This leads to a useful 
interpretation of the GMM minimand. In Section 1.2, Qr(0) was introduced as 
a measure of how far the sample moment is from its expectation of zero. The 
substitution of (2.16) into (2.5) indicates that the minimized value, Qr(6r), 
measures how far the sample is from satisfying the overidentifying restrictions. 
This interpretation proves useful in the development of statistics for testing 
whether the model is correctly specified. However, before we can discuss such 
methods, it is necessary to consider the asymptotic properties of the parameter 
estimator and the estimated sample moment. So, we delay further discussion of 
methods for assessing the model specification until Section 2.5. 


2.3 Asymptotic Properties 


GMM estimation generates two important statistics which play a central role in 
inference about the underlying model; these are the parameter estimator and 
the estimated sample moment. Since the latter depends on the former, it makes 
most sense to begin our discussion of their asymptotic properties with the pa- 
rameter estimator, and then to use these results to analyze the behaviour of the 
estimated sample moment. The asymptotic analysis of the parameter estimator 
focuses on the twin properties of consistency and asymptotic normality. The 
latter facilitates the construction of large sample confidence intervals for the 
elements of 09. As will emerge, these intervals involve a consistent estimator 
of the long run variance of the sample moment, and so we briefly consider how 
such an estimator can be calculated in our simple model. As mentioned in the 
previous section, the estimated sample moment plays an important role in the 
construction of hypothesis tests. In this capacity, it is the asymptotic normality 
of T~!/2Z'u(6r) which is important, and so it is this aspect of the statistic’s 
behaviour upon which we concentrate. 

The asymptotic analysis rests on applications of the Weak Law of Large 
Numbers (WLLN) and Central Limit Theorem (CLT) in Lemmas 1.2 and 1.3 
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respectively. It was noted in Section 1.4.2 that the assumption of strict sta- 
tionarity is insufficient by itself for these theorems and so we must introduce an 
additional restriction. Our purpose here is to illustrate the basic ideas and so 
it is convenient to assume away any dependence structure in the data for the 
time being. 


Assumption 2.4 Independence 
The vector vz = (xi, z1, ue)’ is independent of vts for all s #0. 


Together, assumptions 2.1 and 2.4 imply v; is an independently and identically 
distributed process. 

To begin with, it is most convenient to substitute for y in (2.8). Equation 
(2.1) implies y = X09 + u and using this identity in (2.8) yields 


Êr = bo + {(T71X'Z)Wr(TIZ' X)} (TX! Z)Wr(T ~! Zu) (2.17) 


The consistency and asymptotic normality of Êr can be deduced directly from 
(2.17). We start with consistency. 
From (2.17), it follows that 


plim Op = bo + plim [{(T71X'Z)Wr(T1Z'X)} (T1 X'Z)Wr(T ~! Z'u)] 
(2.18) 
Using Slutsky’s Theorem (see Lemma 1.1), (2.18) can be rewritten as 


plimĝr =o + {plim(T~!X'Z)plim(Wr)plim(T~!Z'X)}7+ 


2.19 
plim(T~'X'Z)plim(W)plim(T~* Z'u) (219) 


From Definition 1.2, it follows immediately that plim(Wr) = W, a positive 
definite symmetric matrix. The limiting behaviour of the other matrices in 
(2.19) can be deduced from the WLLN. Since zx, and z;u; are contemporaneous 
functions of independent processes, they are themselves independent processes. 
Therefore the WLLN yields® 


TOD = TIY ar 


2, Ejaz] (2.20) 
T1Z'u = TEENS au B Eļlau 


[zru] (2.21) 
It is at this point that the population moment and identification conditions 
become important. The identification condition states that E[z:x;] is of rank p 


and so the inverse of E[x,z;] W E[z,2/,] exists. The population moment condition 
states that E|zrut] = 0. Using these two results in (2.19) yields 


plim 6p = bo + M Elzm] = 8o (2.22) 


where M = (F’F)~!F'W1/? and we have again put F = W!/?E[z,2/]. There- 
fore, Or is consistent for Oo. 


5 Strictly, it must be assumed that all stated expectations exist. However, since the 
purpose of this chapter is purely expository, we suppress such details here. 


2.3 Asymptotic Properties 41 


The asymptotic distribution? of the estimator is derived by rewriting (2.17) 
as 


TY? (Êr — bo) = {(T 71 X'Z)Wr (T1 Z'X)} (T1 X'Z)Wr(T7"?Z'u) (2.23) 


and analyzing the behaviour of the components on the right hand side of (2.23). 
Since z;u;z is an independent process, the CLT can be invoked to deduce that 


T 
T V2 Zu =T Y am S N(0,5) (2.24) 


t=1 


where S = limno Var[T 1? D 2,u,] and the mean of this distribution 
follows from the population moment condition. Therefore, T!/2(Êr — 09) = 
Mrnr where Mr converges in probability to the matrix of constants M and 
nr converges in distribution to a normal random vector. Using Lemma 1.4, it 
follows that 

T(r — 69)  N(0, MSM’) (2.25) 


where, as a reminder, M = {E[2;z/]W E|z:2/]}~'E[z12{] W. In the case where 
p = q then M reduces to E[z:2}] and so MSM’ = {E|[z,2}]}~1S{E[x2z1]} 71. 
Equation (2.25) implies that an approximate large sample 100(1 — a)% con- 


fidence interval for 09; is 
Org zali ealt (2.26) 


where rii is the (7,7) element of a consistent estimator of MSM” and za/2 is the 
100(1—a/2) percentile of the standard normal distribution. A consistent estima- 
tor of MSM’ can be obtained from consistent estimators of its components be- 
cause by Slutsky’s Theorem if Mr > M and Sp 2 S then MrSp Mt, 2, MSM’. 
The obvious choice of Mr is {(T~1X'Z)Wp(T-!Z'X)}—-!(T-LX'Z) Wp because 
it has already been shown this matrix converges in probability to M. To con- 
struct Sp it is necessary to be more specific about the form of the long run 
covariance matrix, S. Under our assumptions zu; is an independently and 
identically distributed process with a mean of zero. Together these restrictions 
imply 


Eluus% 2] = Elu?zz'], say fort=s 
0 fort#s 
and so 
T T 
4s anal hy 2 1 
S= jim T DD, Elususzz,] = Elu*zz'| (2.27) 


6 There has been a vast literature on the finite sample properties of IV estimators in the 
linear model. Unfortunately, these results do not generalize to the nonlinear dynamic models 
which are the ultimate focus of this book. Therefore we concentrate on asymptotic results 
here. However, this finite sample theory is briefly reviewed in Section 6.2. 
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White (1984, Chapter 6) demonstrates that S can be consistently estimated by 
N T 

Sp = TS) iaz (2.28) 
t=1 


where ût = Yt — lO. In certain circumstances, more structure can be placed 
on E[u?zz'] which can be exploited in the construction of Sp. For example, in 
most econometric textbooks IV is first encountered in the “classical” model in 
which us possesses the properties: 


Assumption 2.5 Classical Assumptions about wu; 
(i)E{ut] = 0; (ti) Elu?] = 02; (iii) ut and z; are independent. 


Under these assumptions E[u?zz’] = o@E[z1:2/] and this can be consistently 
estimated by 7 
Scriv = 627 Z'Z (2.29) 


where 67, = T~1u(0r)'u(Or) and we have used the “CIV” subscript to emphasize 
the imposition of the Classical assumptions about us but suppressed the T 
subscript for notational simpicity. 

Finally, we derive the asymptotic distribution of the estimated sample mo- 
ment. For reasons that will become apparent, it is most convenient to consider 
the transformed version of this statistic obtained by premultiplying the original 
by wi 2. First notice that 


WIPT? Z'ulôr) = WIPT? Z'u — Wel?T12Z'X T? (Êr — Oo) (2.30) 
and so it follows from (2.23) that 
WIPT? Zur) = (Ig — Pr)Wi P(T- "?Z'u) (2.31) 


where Pr = Fr(Fi,.Fr)~! F, and ~ as in Section 2.3- Fp = W}? (T71 Z'X). In- 
spection of (2.31) reveals that T~!/?Z’u(6r) has a similar structure to T? (6p— 
69) — that is, it takes the form Nrnr where Nr converges in probability to a 
matrix of constants and ny converges to a vector of normal random variables. 
Therefore, we can once again use Lemma 1.4 to deduce the limiting distribution, 
namely 

WAPT"? Z'u(ôr) 4 N(0, NSN’) (2.32) 


where N = [I, — P]W+*/2. In Section 2.3, it is noted that the estimated sample 
moment is closely related to the overidentifying restrictions, and this connection 
also manifests itself in the asymptotic distribution. Equation (2.31) implies that 


WPT"? Zubr) = (Iq — P)W?2T-V2Z'u + op(1) (2.33) 


and so the asymptotic behaviour of wa ?T-1/2Z'u(ĝr) is governed by the func- 
tion of the data which appears in the overidentifying restrictions. Once this 
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relationship is recognized then it becomes apparent that the limiting distribu- 
tion in (2.32) only has mean zero if the overidentifying restrictions are satisfied 
at 0o. One other aspect of the limiting distribution should also be noted. The 
covariance matrix is 


NSN’ = (I, — P)W*?sw¥/?'(I, — P) (2.34) 


Since W!/? and S are nonsingular, it follows’ from (2.34) that rank(NSN’) = 
rank(I, — P) = q — p, and hence that NSN’ is singular.* Notice that this rank 
equals the degree of overidentification and so further emphasizes the connection 
between the estimated sample moment and the overidentifying restrictions. 


2.4 The Optimal Choice of Weighting Matrix 


So far, the analysis has taken the weighting matrix as given and only placed 
fairly mild restrictions on its composition in Definition 1.2. At the same time, 
it has been seen that this matrix plays a crucial role in the analysis because it 
determines the exact nature of the minimand. In this section, we characterize 
the optimal choice of weighting matrix and this leads us to a discussion of the 
two step or iterated GMM estimator. 

To begin, we must consider what is meant by “optimality” in this context. 
An inspection of the previous analysis indicates that the weighting matrix only 
affects the asymptotic properties of the estimator via the covariance matrix in 
(2.25). This can be anticipated from the role of Wr in the estimation. The es- 
timator will converge in probability to the true value as long as the population 
moment and identification conditions hold. Essentially, these conditions ensure 
there is sufficient information from which to estimate 09 and that this informa- 
tion is correct. The choice of weighting matrix determines how this information 
is used and so impacts directly upon the precision of the estimation. It is this 
feature which is captured by the variance of the asymptotic distribution. There- 
fore the optimal weighting matrix is defined to be the value which minimizes 
the asymptotic variance. 

Inspection of (2.25) reveals that it is the probability limit of Wr, W, which 
affects the asymptotic variance of Êr. Therefore, we begin by characterizing 
the optimal value of W and then consider the issues involved in constructing a 
matrix which converges to this limit. For this discussion it is useful to introduce 
the following notation for the asymptotic variance of Êr given in (2.25), 


V(W) = {E|xizi]W Elzai] } Ejziz]| W SW Elz ai H Eleiz ]W E]az]}' 


The optimal value of W, W° say, is the value which minimizes V (W 
matrix sense and so satisfies 


V(W) — V(W?) = a positive semi-definite matrix 


T See Dhrymes (1984) [p.17]. 
8 See Rao (1973) [Chapter 8] for a discussion of the singular normal distribution. 
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for any other valid choice of weighting matrix, W. Hansen (1982) shows that 
W° = S~!. Substituting this value into (2.35) yields 


V (S71) = { Efez] S7! Elax] }t (2.36) 


This matrix V (S71) represents an efficiency bound for GMM estimation of 09 
based on the population moment condition E[z,u,(09)] = 0 because all other 
choices of W result in a variance which is at least as large. 

To construct a GMM estimator which reaches this bound, it suffices to put 
Wr equal to so , where Êr is a consistent estimator of S. This appears to 
create a circularity because (2.28) indicates that Sr depends on Êr; this is 
easily resolved, however. For the consistency of Sr, it is only necessary that 
this matrix is constructed using a consistent estimator of 0o and not the optimal 
estimator. This leads us to Hansen’s (1982) two step procedure for optimal 
GMM estimation. On the first step, a consistent estimator of #9 is obtained 
using GMM with a sub-optimal weighting matrix such as Wr = I, or Wr = 
(T-'Z'Z)-*. This estimator is used to construct Ôr. On the the second step, 
the model is re-estimated using Wr = Se ' These two steps are sufficient 
to obtain an estimator with asymptotic covariance matrix equal to V(S~'). 
However, the estimator of S used in the second step estimation is based on a sub- 
optimal estimator of ĝo and so there may be gains in finite sample performance 
from iterating this procedure. In some cases, iteration may be unnecessary. For 
example, in the Classical regression model setting (Assumptions 2.1-2.5) the 
optimal estimator can be constructed by one Wr just equal to (Pte ay? 
instead of Sar jy because the factors involving ĉĉ., and so Êr, cancel out. In this 
case the optimal estimator can be calculated in one step, and can be recognized 
as the Two Stage Least Squares (2SLS) estimator. In practice, this type of 
convenient cancellation is rare and so iteration is required in most cases of 
interest. 

Finally, a matter of terminology should be addressed. The estimator de- 
scribed in this subsection is typically refered to as “the optimal two step (or 
iterated) GMM estimator”. It is important to remember that this optimality 
only refers to the choice of weighting matrix and there is no implication that the 
population moment condition is optimal in any sense. It is possible to character- 
ize the optimal set of population moment conditions to use in GMM estimation. 
However, this is an extremely complicated problem for the types of model in 
Table 1.1. Therefore, it serves no useful pedagogic value to explore this issue 
here but we return to it in Chapter 7. 


2.5 Specification Error: Consequences and 
Detection 
So far, it has been assumed that the underlying economic/statistical model is 


correctly specified. Unfortunately, this need not be the case, and so it is im- 
portant to consider how specification error would impact on the asymptotic 
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properties of the estimator and the estimated sample moment. Intuition sug- 
gests that such an error renders all inferences suspect at best and completely 
invalid at worse. This is born out by the discussion below, and so motivates 
the development of statistical procedures to assess whether the model is cor- 
rectly specified. In this section we introduce the overidentifying restrictions test 
which has become the standard diagnostic for model specification within the 
GMM framework. Other diagnostics are discussed in Chapter 5. 

To facilitate the discussion, it is useful to recap briefly what aspects of the 
model impact on Êr and T~!Z’u(67). To this end, it is useful to introduce the 
notation M to denote the underlying economic/statistical model. As we have 
seen, this model has the property 


M => Elz:uz(80)] = 0, Vt for some unique ĝo € © (2.37) 


The population moment condition in (2.37) implies the identifying restrictions 
are satisfied at 6) and so Êr both converges in probability to 0) and T!/2(4p — 
Oo) converges to a mean zero normal distribution. The population moment 
condition also implies the overidentifying restrictions are satisfied at 6) and so 
T~/2Z'u(6p) converges to a mean zero normal distribution. 

If M is no longer considered to be the truth, then there are two natural, 
alternative scenarios. First, the true model, M 4 say, although different from 
M, shares the property in (2.37) — that is 


Ma => Elzur(0+)] = 0, Vt for some unique 0, € © (2.38) 


Secondly, the true model, M g say, implies the property in (2.37) does not hold 
— that is 
Mp = A0€ O such that E[zru:(0)] = 0, Vt (2.39) 


Notice that (2.38) can hold for any q > p but (2.39) can only hold for q > p. This 
follows because if q = p then E[z,u:(0)] = 0 represents a set of p equations in 
p unknowns which must perforce have a solution — subject to the identification 
condition in Assumption 2.3. We now consider the behaviour of the estimator 
and estimated sample moment under M 4 and Mp. 

First, consider the case where the true model is M4. Since M and M 4 are 
different by definition, they must have different implications for some aspect of 
the distribution of v;. However, a comparison of (2.37) and (2.38) indicates that 
M and M4 have the same implications for E[z,uz(0)] — the only potential dif- 
ference being in the parameter value at which the moment condition is satisfied. 
The population moment condition in (2.38) implies the identifying restrictions 
are satisfied at 04}, and so the analysis in Section 2.3 can be replicated to show 
that Op converges in probability to 64. Furthermore, this analysis can be con- 
tinued as before to show that T!/2(67 — 04) converges to a mean zero normal 
distribution. Equation (2.38) also implies the overidentifying restrictions are 
satisfied at 0} and so this in turn implies that the estimated sample moment 
converges to a mean zero normal random vector. So the only potential difference 
between M and M4 is in the value to which Êr converges. However, as stated 
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above, neither model implies anything about the value of 6 which satisfies the 
population moment condition beyond its uniqueness. Therefore, M and MA 
are observationally equivalent on the basis of E[z,uz(0)] alone. 

In contrast, M and Mg have very different implications for E[z:u;,(6)]. 
Equation (2.39) states that there is no value of 0 at which the population mo- 
ment condition is satsified. In spite of this, there must be a solution to the 
identifying restrictions because they constitute a set of p equations in p un- 
knowns.” If this solution is denoted 0,, then it follows by the same logic as 
before that Êr converges in probability to 6,. It is also possible to develop an 
asymptotic distribution theory for the estimator in this case, but the analysis 
is more complicated than under M. However the most important difference 
emerges in the behaviour of the estimated sample moment. The analysis in 
Section 2.3 can be replicated to show that 


WIPT"? Z'ulôr) = (Iq — P)W¥?2T-/2Z'u(8,) + Op(1) (2.40) 


It is apparent from (2.40) that the asymptotic behaviour of WPT? Zubr) 
is determined by whether or not the overidentifying restrictions are satisfied 
at 0,. The answer to this question can be deduced from the properties of 04. 
By definition, 6, satisfies the identifying restrictions and (2.39) implies that 
E|z,uz(0..)| 4 0. Since, 


W/E lz,u4(6.)] = PW? E[z,u:(0.)] + Ug — P)W 1/7 E [2:1 (6.)] 


it must follow that 
(I, — P)W'/7 E[z,u:(0.)| £ 0 (2.41) 


Equations (2.40) and (2.41) imply that W} °T- "2 Z'u(ĝr) is not O,(1) — as it 
is under M or M4 — but diverges at rate T!/? and, in consequence, does not 
converge in distribution.!° 

Regardless of whether M 4 or Mp is the truth, it is desirable to develop sta- 
tistical tests which can indicate that the assumed model is incorrect. Clearly, it 
is impossible to discriminate between M and M 4 on the basis of T~!/?Z/u(6r). 
This can only be achieved by deducing a different set of moment conditions from 
M and testing whether they are corroborated by the data.t! On the other hand, 
M and Mpg have different implications for the overidentifying restrictions and 
so it would be anticipated that it is possible to discriminate between these two 
models based on the estimated sample moment. 

Sargan (1958) was the first person to introduce the idea of testing the overi- 
dentifying restrictions in a linear model estimated by IV, and Hansen (1982) 
extended the statistic to the GMM framework. It is natural to base the test on 
the GMM minimand, Qr(6r), since it is shown in Section 2.3 that this statistic 
measures how far the sample is from satisfying the overidentifying restrictions. 


9 Again, subject to the identification condition in Assumption 2.3. 

10 See Chapter 4. 

11 However, the same problem recurs because there is always more than one probability 
distribution which can generate a finite set of population moment conditions. 
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To develop the distribution theory, it is most convenient to focus on the opti- 
mal GMM estimator, and so we set Wr = Sy 1. Therefore, the overidentifying 
restrictions test statistic!” is given by 


Jr =TQr(6r) = TT ?u(6r)'Z $p! T? Z'ulĝr) (2.42) 
Under the null hypotheses, 
Ho : E[zru(0o)] = 0 


Jr converges in distribution to a 7_,.'° Notice that the degrees of freedom 
equal the number of overidentifying restrictions. Intuition suggests that Jr can 
detect when the true model is actually M g, and this is verified in Chapter 5. 


2.6 Summary 


The purpose of this chapter is to introduce the main elements of the GMM 
framework using the example of the IV estimator in the static linear regression 
model. This approach is feasible because the intrinsic information in IV esti- 
mation takes the form of a population moment condition. Specifically, IV rests 
crucially on the existence of a vector of instruments, z4, that are uncorrelated 
with the regression error, u;(09), or equivalently that the instruments satisfy 
E|z,uz(00)| = 0. If this population moment condition is used as a basis for 
GMM estimation then the resulting GMM estimator is the IV estimator. The 
advantage of deriving IV in this way is that it enables us to highlight seven key 
features of the GMM framework: 


e Identification: For the estimation to be succesful, the population moment 
condition must not only be valid but also provide sufficient information 
to identify the parameter vector. 


e Identifying and overidentifying restrictions: GMM estimation in overiden- 
tified models involves a fundamental decomposition of the moment con- 
dition into identifying restrictions and overidentifying restrictions. The 
identifying restrictions contain the information that goes into the estima- 
tion, and the overidentifying restrictions are a remainder that manifests 
itself in the estimated sample moment. 


e Asymptotic properties: The GMM estimator is consistent and, when ap- 
propriately scaled, has a limiting distribution that is normal. 


12 This is also sometimes refered to as the J-test. 

13 It can be recognized that the overidentifying restrictions test is a direct extension of 
Neyman and Pearson (1928) statistic GF (Êr) discussed in Section 1.2. At first glance, the 
degrees of freedom appear to be in conflict; however, there is a logical explanation. Only k— 1 
of the population moment conditions in (1.7) are free: the kt? condition, say, is implied by 
first k — 1 plus the constraints we [Di (2) — h(i; 0o)] = 0 which must hold because of the 
definitions of D+(i) and h(i; 60). 
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Estimated sample moment: The estimated sample moment is shown to 
have a limiting normal distribution whose attributes depend directly on 
the function of the data in the overidentifying restrictions. 


Long run covariance estimation: To translate this asymptotic normality 
into practical inference procedures, it is necessary to estimate the long run 
variance of the sample moment consistently. 


Optimal choice of weighting matrix: The optimal choice of weighting ma- 
trix depends on the long run variance of the sample moment and so its 
use typically involves a two step or iterated estimation. 


Model diagnostics: The overidentifying restrictions provide a basis for 
testing the validity of the model specification via the estimated sample 
moment. 


Subsequent chapters build from this foundation to present the GMM framework 
in nonlinear dynamic models. Chapter 3 focuses on estimation and, in its course, 
extends the discussion of the first five aspects highlighted above to the general 
setting. The statistical properties derived in Chapter 3 are premised on the 
assumption that the model is correctly specified. Chapter 4 considers the impact 
of misspecification on the limiting properties of the GMM estimator. Chapter 5 
derives the large sample properties of both the overidentifying restrictions test 
and also a number of other hypothesis tests which have been proposed within 
the GMM framework. 


3 


GMM Estimation in 
Correctly Specified Models 


The previous chapter has provided an introduction to the GMM framework and 
the types of inference issues which arise within it. Although many of the details 
reflected the static, linear nature of the model, the underlying intuition did 
not. The essential feature of the estimation is the minimization of a quadratic 
form in the sample analog to a population moment condition which provided 
sufficient information to identify the unknown parameters. In this chapter, we 
show this strategy can be successfully extended to nonlinear dynamic models. 
The focus here is on the estimator and the derivation of its statistical properties 
in correctly specified models. The impact of misspecification on these properties 
is examined in Chapter 4. Matters of inference are postponed until Chapter 5 
when a variety of hypothesis testing procedures are reviewed. The level of the 
discussion is more rigorous than the previous chapter, and the main results are 
formally proved. However, the issues are also illustrated throughout with an 
empirical example to provide guidance on the practical implementation of the 
estimator as well. Here, we focus on Hansen and Singleton’s (1982) consumption 
based asset pricing model which was described in Section 1.3.1. Chapter 9 
reports empirical results for the other four models in Section 1.3. 

Section 3.1 defines the population moment condition and presents condi- 
tions for parameter identification. Section 3.2 discusses the calculation of the 
estimator in practice and includes a brief review of numerical optimization tech- 
niques. Section 3.3 extends the fundamental decomposition of the population 
moment condition into identifying and overidentifying restrictions to the nonlin- 
ear model. Section 3.4 derives the asymptotic properties of the estimator and 
the estimated sample moment. Section 3.4.1 presents a proof of consistency 
and Section 3.4.2 derives the asymptotic distribution of the estimator, and also 
uses this analysis to provide further insights into the form of the identifying 
and overidentifying restrictions. Section 3.4.3 derives the asymptotic distribu- 
tion of the estimated sample moment. Section 3.5 describes the construction 
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of consistent estimators of the long run variance under three scenarios for the 
dynamic structure of the sample moment. Section 3.5.1 covers the case where 
f (vt, 90) is a serially uncorrelated process; Section 3.5.2 considers the case where 
f (vt, 90) is generated by a vector autoregressive moving average process; and fi- 
nally Section 3.5.3 considers the class of heteroscedasticity and autocorrelation 
covariance (HAC) matrix estimators whose properties only require the depen- 
dence structure to satisfy very mild restrictions. Section 3.6 derives the optimal 
choice of weighting matrix and this leads to a discussion of the two step and 
iterated GMM estimators. Section 3.7 examines the consequences of various 
transformations and normalizations on the GMM estimator, and this leads to a 
discussion of both the continuous updating GMM estimator and also the con- 
struction of confidence intervals based directly on the GMM minimand. The 
chapter concludes with a slight detour. In Chapter 1, it is stated that many 
estimators can be viewed as special cases of GMM. Although some simple ex- 
amples were provided, it was not possible to elaborate on the point at that 
stage. However, this is possible after the material in the first five sections of 
this chapter. Section 3.8 shows formally how other estimators can be fit within 
the GMM framework. Section 3.9 contains a summary of the chapter. 


3.1 Population Moment Condition and 
Parameter Identification 


In Chapter 1, it was shown that a wide variety of econometric models lead to 
population moment conditions which involve nonlinear functions of the data and 
parameters. It is therefore desirable to adopt a very general framework which 
encompasses all these cases. This means that the analysis in this chapter begins 
with the population moment condition and no attempt is made to character- 
ize the specific data generation process which lays behind it. This population 
moment condition involves a function f(.,.) of the observable vector of random 
variables v; and the unknown (px 1) parameter vector, ło. As before the param- 
eter space is denoted by © C RP. However, before we introduce the population 
moment and identification conditions, certain restrictions need to be placed on 
v, and f(.,.). 


Assumption 3.1 Strict Stationarity 
The (rx 1) random vectors {v;; —œ0 < t < co} form a strictly stationary process 
with sample space V C R”. 


Recall that this assumption implies all expectations of functions of vz; are inde- 
pendent of time. 


Assumption 3.2 Regularity Conditions for f(.,.) 

The function f : V x © — RI, where q < co, satisfies: (i) it is continuous on 
© for each uy E€ V; (ii) Elf (vz, 0)] exists and is finite for every 0 € O; (iii) 
E|f (vz, 9)] is continuous on ©. 
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Formally, it is necessary to assume that f(.,0) is a measurable function but we 
suppress this type of condition throughout the text. All functions considered 
are assumed to be measurable; see Newey and McFadden (1994) for a discussion 
of circumstances in which this may not hold. Assumption 3.2 holds in most, 
if not all, of the models behind the studies listed in Table 1.1. However, this 
assumption excludes some cases of interest, such as step functions which are 
by their nature discontinuous. One further aspect of Assumption 3.2 should be 
noted. The function f(.) is assumed to be finite dimensional. This assumption 
is standard and satisfied in all the applications listed in Table 1.1. However, 
there are circumstances in which it may be desirable to relax this assumption. 
In Section 6.1.3, we consider the limiting behaviour of the estimator when q 
tends to infinity with the sample size. It is also possible to generalize the 
GMM framework to a continuum of moment conditions but we do not pursue 
this extension. For the latter, the interested reader is refered to Carrasco and 
Florens (2000). 
The analysis centers on the following population moment condition. 


Assumption 3.3 Population Moment Condition 
The random vector v; and the parameter vector 09 satisfy the (q x 1) population 
moment condition: E| f(v, 0o)] = 0. 


Just as in the linear model, the population moment condition can only be used 
as a basis for estimation if it provides enough information to uniquely identify 
the parameter vector 99. In the linear model, it is possible to relate parameter 
identification to a simple condition which only involved the data. In nonlinear 
models, the situation is more complicated. Identification can fail due to the 
properties of the data, v+, or due to the properties of f(.) as a function of 0 or 
due to an interaction of the two. To characterize how these types of failure can 
occur in nonlinear models, it is necessary to introduce the concepts of global and 
local identification. The need for this distinction will become apparent below. 
The basic condition for parameter identification is given by: 


Assumption 3.4 Global Identification 
E| f(v, 8)] 40 for all 0 € © such that 0 A bo. 


The adjective “global” emphasizes that the population moment condition 
only holds at one value in the entire parameter space. This can be recognized 
as the concept of identification used in our discussion of the linear model in the 
previous chapter. Within that context, it was possible to derive a convenient 
condition for global identification. Unfortunately, this is rarely possible in non- 
linear models. However, there is one type of identification failure in nonlinear 
models which can be diagnosed using the condition in Assumption 3.4. This is 
the case when failure occurs due to the nature of f(.) as a function of 6. This 
type of problem is best understood by considering two examples: in the the first 
there are just two values of 0) which satisfy the population moment condition; 
in the second, there are an infinite number of values which do so. 
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Example: The Partial Adjustment Model 
Suppose the data are generated by the model! 


Yt — Yr = Poly” — Yyt-1) + u 
Ut = pout- + et 


where y* represents the desired level of the process y, and ez is an i.i.d. process 
with mean zero. Simple rearrangement yields 


Ye = Boll — po)y* + (1 + po — Bo)yt-1 + (Bo — L)poye—2 + et (3.1) 


Now suppose there exists a set of variables z which satisfy the population 
moment condition E[z,e:(00)| = 0 where e:(0) = ye — B01 — p)y* — (1+ p 
B)ye-1 — (8 — 1)pyz_2 and 6 = (G,p,y*)’. Although this is very similar to the 
population moment condition in Chapter 2, it is outside that framework because 
e+(0) is a nonlinear function of 0. Using the condition in Assumption 3.4, the 
parameter vector is identified if E[z,e,(0)] = 0 at only 0 = 09. To see if this 
holds, it is useful to introduce the notation 


erlu) = Yt — Ho — Hi Yt-1 — H2Yt-2 (3.2) 


where u = (Uo, H1, H2). Equation (3.2) can be viewed as a type of “reduced 
form” version of e;(@) because any value of 0 implies a value for u via the rela- 
tionship, 


Ho = p- p)y* 
Hi = 1l+p-8 (3.3) 
2 = (B-l1)p 


Using these definitions, the condition for identification can be restated as the 
requirement that each value of u is implied by only one value of 0. However, in- 
spection of (3.3) reveals this is not the case here. The problems arise because the 
bottom two equations imply a quadratic equation for p, namely p?—pu1— u2 = 0, 
to which there are two solutions. Denote these by pı and p2. Each of these so- 
lutions implies a value of @ which satisfies the bottom two equations as well; 
denote these by 6; = 1+ p2/p; for i = 1,2. Finally let y* = uo/{Gi(1 — pi)}. 
Clearly 0; = (Gi, pi, y;)' yields the same value of u for both i = 1,2 and so 
Assumption 3.4 is violated. © 


Example: Eichenbaum’s (1989) Model for Inventory Holdings by 
Firms 


1 This type of model has been used to analyze a wide variety of economic series including 
money demand and inventory holdings. In these applications exogenous regressors are also 
included and formally this removes the identification problem. However, if the regressors only 
play a very marginal role then the same type of identification problems can emerge; see Blinder 
(1986), Hall and Rossana (1991). 
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It is shown in Section 1.3.4 that Eichenbaum’s (1989) model for inventory hold- 
ings implies that the following population moment condition holds 


El{hi+2(Wo) — pohe+1(wWo) yz] = 0 (3.4) 


where hil) = Teg, — {A+ (AB) 1G + 67H- + Sin — 6671S; and Y = 
(A, 6, ¢)'. In our earlier discussion of this model, ¢ is treated as a parameter 
to be estimated rather than the three underlying parameters of which it is a 
function. It may have been wondered why (6,7, a) are not estimated directly 
and the answer is that they are not identified by the population moment condi- 
tion. The problem arises because the elements of (ô, y, œa) only appear in a ratio 
form via ¢ = 1 — dy/a. Therefore, for any non-zero constant k, both (6,7, @) 
and (ké,7,ka@) yield the same value of ¢. This would clearly cause a violation 
of Assumption 3.4. However, there is no such problem if only ¢ is estimated 
instead. © 


In both these examples, the identification failure arises because of the na- 
ture of f(.) as a function of 6. As mentioned above, identification can fail for 
other reasons but these are harder, if not impossible, to diagnose by examining 
E|f (vt, 9)| directly. In the linear model of the previous chapter, it is possible 
to deduce a relatively simple condition for global identification and it would 
clearly be desirable to develop something similar for nonlinear models. Unfor- 
tunately, this cannot be done because it is typically impossible to find a useful 
alternative representation for f(v,,0) which holds over all 6 € ©. However such 
a representation can be found if attention is limited to some suitably defined 
neighbourhood of 69. The price of this approach is that we are now deriving con- 
ditions for identification only within this neighbourhood and these are refered to 
as conditions for local identification. As the names suggest, local identification 
does not guarantee global identification but global identification cannot hold 
without local identification. Therefore, a more transparent condition for local 
identification is useful because it provides insights into when identification can 
fail. 

To derive the condition for local identification, it is necessary to introduce 
the following definition and assumption. An e-neighbourhood of Oo is defined 
to be the set Ne which satisfies Ne = {0; ||0 — Ool] < e}. The aforementioned 
alternative representation of f(.) is based on a first order Taylor Series approx- 
imation for f(v+,0) over a neighbourhood of the form Ne. For this to be valid, 
it is necessary that N, C © and so 09 must be an interior point of ©.? So this 
condition is included with certain other regularity conditions in the following 
assumption. 


Assumption 3.5 Regularity Conditions on Of (v, 0)/30' 

(i) The derivative matrix Of (v4,0)/00’ exists and is continuous on © for each 
ve E V; (ii) Oo is an interior point of O; (iii) E[Of(v;,09)/00'| exists and is 
finite. 


2 In other words ĝo must not lie on the boundary of ©. See Apostol (1974) [p.49] for 
definition of the interior of a set. 
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Part (i) of this condition is satisfied by most, but not all, of the models behind 
the studies listed in Table 1.1. If violations occur they tend to stem from the 
presence of absolute values for which the derivative is not defined everywhere 
on ©. For example, the stochastic volatility model in Section 1.3.5 leads to 
population moment conditions which involve absolute values. It is possible 
to develop local identification conditions in these situations but the analysis 
becomes more complicated.4 Since these cases tend to be the exception rather 
than the rule, we work here within the framework of Assumption 3.5. Notice 
that the other four models in Section 1.3 satisfy Assumption 3.5(i) and the other 
two parts of the assumption can reasonably be expected to hold as well. 

The condition for local identification is derived by restricting attention to 
sufficiently small € so that f(.) is equal to the following first order Taylor series 
expansion ° in Ne 


F(ve 0) = f(v 40) + {OF (v, 90) /00"} (0 — b0) (3.5) 


The advantage of this approach is that (3.5) implies f (v+, 0) is a a linear function 
of 0 — 69 in this neighbourhood. Taking expectations on both sides of (3.5) and 
using Assumptions 3.3 and 3.5 yields 


Elf(v:,9)] = {E[Of (ve, 90)/06]} (8 — 40) (3.6) 


Equation (3.6) is essentially the same structure as (2.3) and so we can appeal 
to our earlier analysis of the the linear model to deduce the following condition 
for local identification. 


Assumption 3.6 Local Identification 


rank{ Eð f (vz, 09)/00']} = p. 


This condition can be recognized as the generalization of the identification 
condition for the linear model given in Assumption 2.3.8 Notice the form of 
the condition immediately implies identification fails if there are fewer moment 
conditions than parameters, i.e. q < p. While this is no surprise given the 
discussion in Chapter 2, this restriction was not immediately apparent from 
the global identification condition in Assumption 3.4. As in the linear model, 
this type of condition can also fail if q > p. However, one important difference 
is that identification in nonlinear models may be sensitive to the value of 05 
via Of (v;,0)/06’. This opens up the possibility that the population moment 
condition may provide enough information to identify the parameters at some 
values of 09 but not at others. 


3 Another example is encountered in Section 9.1 when we consider an extension of the 
mutual fund evaluation method described in Section 1.3.2. 

4 The interested reader is refered to Newey and McFadden (1994) [Section 7]. 

5 See Apostol (1974) [p.361]. 

6 In the linear model, f(ve,0) = zrut(0) and so Of(vt,90)/00’ = —zt2!,. The condition 
implies global identification in the linear model because (3.5) is then an identity which holds 
for all 0 and not just in a neighbourhood of 6. 
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Clearly, the exact nature of the condition in Assumption 3.6 depends on 
the f(.) in question. To illustrate the types of condition which can arise in 
practice, we now examine local identification in three examples. We begin with 
continuations of our earlier examples to illustrate the difference between global 
and local identification. We then derive the local identification condition for the 
consumption based asset pricing model in Section 1.3.1. Further examples can 
be found in Chapter 9. 


Example: Partial Adjustment Model (Continued) 
Recall that f(v1z,@) = ze:(@) and so Of (vt, 90)/00’ = 2,0e,(00)/00’. From the 
definition of e,(6) and @ it follows that 


ElOf (vt, 00)/30'] = Elze%;|M (80) (3.7) 
where £ = (1, yz-1, Yt-2) and 


—(1— p)y* By* PCE p) 
M(0) = 1 —1 0 


Given this structure, it follows that” 
rank{ E[ð f (vı, 00)/30]} < min{rank(E|z,%4]), rank(M (00))} 


Inspection of M (0o) indicates that in general this matrix is of full rank and so 
rank{ E|Of (vı, 00)/30']} = rank(E[z,21]).8 Therefore local identification rests 
on the relationship between the instruments and čą in a similar way to our 
earlier analysis of the linear regression model. Assuming this rank condition 
holds, 69 is locally identified. 

It is informative to relate this conclusion back to our earlier analysis of this 
model. It was shown there that the parameter vector is globally unidentified 
because there are two values of 0 which satisfy the population moment condition. 
This failure arose because the solutions for p satisfy a quadratic equation to 


which the roots are 
p= m + (y Hî + 42) /2 


Notice that this structure suggests the two solutions are distinct values of 6 
and not within an € neighbourhood of each other for some suitably small value 
of e. It is therefore consistent with the finding that the two solutions are locally 
identified even though 9p is globally unidentified. © 


Example: Eichenbaum’s (1989) Model for Inventory Holdings by 
Firms (Continued) 
We again reconsider the problem of estimating the augmented parameter vector 


T See Dhrymes (1984) [Proposition 7, p.17]. 
8 See Ibid [Proposition 6, p.16]. 
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in which (ô, y,q@) are included in w instead of ¢. To simplify the analysis it is 
convenient to set p = 0 but this does not effect the essence of the argument. 
This case maps into our generic notation with f(v,00) = hi+2(Wo)z: where 
0 = (A,G,6,y,a@)/.. As in the previous example the nonlinearity only arises 
through the parameters and so the derivative matrix has a similar structure 


E[ð f (v, 0)/30'] = E|%,%;]M (60) (3.8) 


except this time Lt = (Liss, Ii, Sri)! and 


(\78)-* -1 (A7)-* 0 0 0 
M(6) = 0 -6-2 0 0 0 
0 (1—dy/a)B-? yB-'/a 58-*/a —ébyB—"/a? 


However this time it is immediately apparent that rank{M(0)} < 3 and so 
rank{E|Of (vı, 00)/30']} < 3 < p. Therefore ĝo is locally unidentified in this 
model. Again this result ties in with our previous analysis of global identifica- 
tion. It was shown before that (6,7, @) and (kd, 7, kā) yield the same value of 
for any nonzero constant k. Since k can be arbitrarily close to one, it follows that 
if Ao = (Ao, Bo, ĝo, Yo, @o)’ satisfies the population moment condition then there 
is always another value 6, = (Ao, Bo, kôo, Yo, kao)’ within an € neighbourhood of 
0o which also satisfies the population moment condition for any e > 0. 

Finally, it should be noted that this problem disappears if ¢ is treated as 
a parameter to be estimated instead of (6,y,a@). To see this, redefine the pa- 
rameter vector to be 0 = (A, 3, ¢)’. In this case, f(v, 40) /00’ is given by (3.8) 


with 
(ABs Pal (A8 0 
M(6) = 0 —p-? 0 
0 pp? -8 


It is immediately apparent that rank{ M(0)} = 3 and so local identification 
depends on whether rank{ E[z,%/|} = 3. ° 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

It is shown in Section 1.3.1 that if the representative agent possesses a CRRA 
utility function then the data and parameter vector, 6 = (y,6)’, satisfy the 
population moment condition in (1.23). For our purposes here, it is convenient 
to restrict attention to the case in which there is only one asset with a maturity 
of one period. The population moment condition is then E[z;uz(0o)] = 0 where 
uz(@) = bx} 4 442,241 — 1, and we have set e441 = Ct+1/Ct, L241 = Tt+1/Pt 
with the j subscript being dropped as there is only one asset. In this model, we 
have 


E[Of (v, 0)/00'] = Elzdlog(x1,41) 27 441 22,041 » 22) 4 122,041] (3.9) 


For local identification this matrix must have rank two when evaluated at 6o. 
Apart form requiring z to contain at least two elements, it is not easy to deduce 
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from (3.9) when this rank condition holds. © 


These three examples illustrate how the rank condition can highlight what 
aspects of the model are important for identification. However, as we have 
also seen, it may be difficult to determine a priori whether these conditions are 
satisfied for the data in hand. In practice, failures in identification may only 
become apparent when estimation is attempted and so we return to this topic 
in that context in the next section.’ 


3.2 The Estimator and Numerical 
Optimization 


It can be recalled from Definition 1.2 that the GMM minimand takes the form, 


T T 
0) = {T7 $ Fos DYWAT) fo 0)} (3.10) 


For completeness we restate the properties of the weighting matrix here. 


Assumption 3.7 Properties of the Weighting Matrix 
Wr is a positive semi-definite matrix which converges in probability to the pos- 
itive definite matrix of constants W. 


By definition, the GMM estimator of 0o is 
br = argmingeo Qr(0) (3.11) 


where “argmin” stands for value of the argument — 0 — which minimizes the 
function — Qr(0). If Assumption 3.5 holds, and in most cases of interest it will, 
then the first order conditions for this minimization imply 0Q7(6r)/00 = 0. 
This condition yields!° 


o= ry ee 


In the linear model of Chapter 2, these conditions could be solved to obtain a 
closed form solution for Êr as a function of the data. Unfortunately, in nonlinear 
models this is typically impossible. For example, the first order conditions for 
Hansen and Singleton’s (1982) consumption based asset pricing model are 


oF 2 VWr{T- 15> Fon ôr)} (3.12) 


t=1 


T 
0 = {T X [edrlog(ai141)27 54122041 5 421541 22e41]} Wr 
AÀ 
T pa 
{TTS a (bre} yy 122,041 —1)} (3.13) 
t=1 


9 Also see Section 3.6. 
10 See Dhrymes (1984) [Proposition 92, p.111]. 
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Only a little trial and error is needed to verify that these cannot be solved to 
produce a closed form solution for Êr. 

Back in the days of Karl Pearson, the story would have stopped here. Fortu- 
nately, the advance of computer technology over the last forty years has enabled 
the development of a vast array of numerical optimization routines which can be 
used to calculate Êr. These days, such optimization procedures can be imple- 
mented with just a few lines of code in most econometric or statistical software 
packages. In view of this, we do not provide a comprehensive review of these 
procedures here.!! Instead we briefly discuss certain issues involved in their 
implementation. 

These types of computer based routines essentially perform an “informed 
version” of trial and error to find the value of 6 which minimizes Qr(0). The 
procedure begins with some trial value of 0, 6(0) say. If this is the value which 
minimizes Qr(0) then it should not be possible to find a value of 6 for which 
the minimand is smaller. So the computer uses some rule to see if it can find 
a value of 0, (1) say, which satisfies Q7(0(1)) < Qr(0(0)). If it can, then (1) 
becomes the new candidate value for Êr and the computer searches again to see 
it can find a value 6(2) such that Qr(0(2)) < Qr(@(1)). This updating process 
continues until it is judged that the value of 0 which minimizes Q(0) has been 
found. It is useful to distinguish three important aspects of such routines. 


e The starting value for 0, 6(0). 


e The iterative search method by which the candidate value of Êr is updated 
on the it” step. 


e The convergence criterion used to judge when the minimum has been 
reached. 


The various numerical optimization routines differ in how the iterative search 
method is performed. In most problems it is computationally infeasible to per- 
form a search over the entire parameter space!” and so some rule is used to limit 
the calculations involved. For example, in a class known as Gradient Methods!’ 
the value of 0 is updated on the it” step by 


a(i) = 8i — 1) + A,D(O(4 — 1)) 


where à; is a scalar known as the step size and D(.) is a (p x 1) vector known 
as the step direction. The step direction vector is a function of the gradient, 
OQr(@(i — 1))/00, and hence reflects the curvature of the function at 0(i — 1). 


As the names suggest, D(0(i — 1)) determines the direction in which to update 


6(4— 1) and à; determines how far to go in that direction. 


11 Many excellent surveys already exist in the econometrics literature e.g. Quandt (1983), 
Judge, Griffiths, Hill, Lutkepohl, and Lee (1985)[Appendix B] Gallant (1987)[Chapter 2]. 

12 Such a strategy is known as a grid search. 

13 For example, see Judge, Griffiths, Hill, Lutkepohl, and Lee (1985) [p.953]. 
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Convergence can be assessed in a number of ways. For example, if @(i) is 
the value which minimizes Qr(0) then the updating routine should not move 
away from this point. This suggests that the minimum has been found if 


\|A(i + 1) — A(i)|| < € (3.14) 


where e is an arbitrarily small positive constant. A typical value for e is 10~° or 
less. This rule allows for the fact that the update A;,,.D(0(i)) is unlikely to be 
exactly zero even if @(i) is the minimum due to rounding errors in calculation. 
As stated in (3.14), the convergence criterion is independent of the magnitude 
of 0. In practice, this may be a problem if the latter is very small. Ideally, e€ 
should be replaced by 7(||9(i)|| +7) where and 7 are small positive constants 
in the order of 1075 and 107° respectively. However, in some commercially 
available computer packages the rule is of the form in (3.14). If this is the 
case then the user must be sensitive to the order of magnitude of of 6 when 
choosing e. Alternatively, convergence can be assessed by examining the first 
order conditions. Once the minimum is reached then (3.12) should be satisfied 
and this leads to the criterion 


||AQz (8())/84| < € (3.15) 


where again allowance is made for rounding errors. Finally, if the minimum has 
been reached then the updating should not alter the value of the minimand and 
so 

|Qr (A(t + 1)) — Qr(4(4))| < € (3.16) 
Once again, it is desirable for the convergence criterion to reflect the size of the 
objective function and so a better version of the rule is obtained by substituting 
n(Qr(@(z)) +7) for e in (3.16). However, as above, the convergence criterion 
in some commercially available packages takes the form in (3.16) and if it does 
then the user must be sensitive to the values of the minimand in choosing 
e. Which rule should be used? It is often prudent to check all three because 
anyone can be satisfied by itself without the minimum being reached; see Quandt 
(1983) [p.737-8], Gallant (1987) [p.29]. 

The choice of starting values is also important. Ideally, 6(0) should be as 
close as possible to the value which minimizes Q7(0) because this reduces the 
number of iterations and hence the computational burden. Sometimes a pre- 
liminary estimate of fo is available and this can be used as a starting value.!4 
Whether this is the case or not, it is a wise precaution to run the routine with 
more than one set of starting values. In nonlinear models, the minimand may 
exhibit a less regular topology than in the linear model with the result that the 
numerical routine can have problems finding the minimum. The use of multiple 
starting values provides some safeguard against this problem because the rou- 
tine can be restarted outside of the problem areas. However, if these problems 


14 This would be the case when calculating the two step or iterated GMM estimator; see 
Section 2.4 and 3.6. In other cases, various rules have been suggested for the calculation 
of starting values. We do not describe these here but refer the interested reader to Gallant 
(1987) [pp.29-30] and the references therein. 
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persist from different starting values then this may indicate the parameter vec- 
tor is unidentified by the population moment condition upon which estimation 
is based. 

To conclude this section, we provide an illustration of these issues. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

Hansen and Singleton (1982) estimate their model with various choices of assets. 
We concentrate here on just two of these choices; both are portfolios constructed 
from all the stocks on the New York Stock Exchange and the difference between 
them derives from the weights used in the portfolio. In one, all the assets 
receive an equal weight; this choice is refered to as “equally weighted returns” 
and denoted EWR. In the other, the weights on the assets reflect their relative 
values; this choice is refered to as “value weighted returns” and denoted VWR. 
In principle, the population moment condition in (1.23) holds jointly for both 
choices of assets but it is pedagogically more convenient to estimate the model 
separately for each choice of asset. Each asset has maturity m = 1 and so (1.23) 
implies each of the assets satisfies 


Elat(Sox]94122,t41 — 1)] =0 (3.17) 


where 1241 = Ct+1/Ct, T2441 =Tt41/pe and zs is the vector of instruments. 
To implement the model, it is necessary to specify z:. In Section 1.3.1 
it is shown that this moment condition holds for any z E€ Q, and so the 
economic model leaves open a lot of possibilities. Our identification analysis 
indicated ĝo = (yo, 60)’ is locally identified by (3.17) if the rank of the matrix 
in (3.9) is two. As remarked above, this is not particularly illuminating apart 
from the requirement that z; has at least two elements. With so many options 
available, Hansen and Singleton (1982) estimate the model with a number of 
different choices of instrument. However, here we will focus on just one to 
simplify the presentation; this choice is z; = (1, %1,2,%1,1-1, 2,4, 21-1)’. It is 
also necessary to choose a value for the weighting matrix. We use two common 
choices (T71 7/_, %2/)~! and cls where c is a constant that is discussed below. 
Hansen and Singleton (1982) estimate the model using monthly U.S. data 
for the period 1959:2-1978:12, but we take advantage of the march of time to 
use an extended sample covering 1959:1-1997:12. Once allowance is made for 
the two conditioning observations needed to construct z+, this leaves a sample 
of size T = 465. The consumption of the representative agent in period t, c+, is 
defined to be aggregate real consumption of nondurables and services in period 
t divided by total population in period t. Both consumption and population 
series are compiled by the U.S. Department of Commerce, and obtained from 
the FRED database constructed by the Federal Reserve Bank of St. Louis. The 
consumption figures are seasonally adjusted and expressed in billions of chained 
1992 dollars. The nominal return on the assets is obtained from the CRSP 
tapes, and transformed into a real return using the implicit deflator associated 
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with the measure of consumption. Specifically, this gives 


deflator at time t 


L241 = (1 + nominal return at time t + Oor phe 


where the deflator at time t is the ratio of aggregate real consumption of non- 
durables and services at time t to its nominal counterpart in period t. The latter 
is also seasonally adjusted and has the same source as the real data. 


The estimations are performed by minimizing TQ7(@) using routines in the 
MATLAB version 6.0 Optimization Toolbox (Mathworks, 2000). This package 
provides a number of optimization procedures. All our estimations employ the 
procedure fminu which is a variant of the gradient method described above.'® 
This estimation routine allows the researcher to specify constants which control 
the convergence criterion for the parameters and the minimand. In our estima- 
tion these two numbers are set equal and denoted em. To illustrate their impact 
on the results, we perform the estimations using em = 1074, 1075 and 107°. 


We begin with the estimation of the model for EWR. The results are pre- 
sented in Table 3.1. Consider first the results for the case in which Wr = 10°Js. 
The scaling factor of 10° is included because if Wr = Is then the value of the 
minimand is of the order 10~° for parts of the parameter space and this made it 
difficult for fminu to find the minimum.'® Even with this scaling, the minimand 
appears ill behaved. When em = 1074, all four starting values do not initiate 
procedures which converge to the same point. This behaviour could arise for 
two reasons. First, the minimand may have a well-defined local minimum at 
each of the two points to which the algorithm converged. In this case the param- 
eters are locally identified at each point but obviously not globally identified. 
Secondly, the convergence criterion may be insufficiently tight and the iterative 
procedure is stopping before it reaches a local minimum. To assess which is 
the case here, we re-estimate with em = 1075 and then with em = 1078. As 
can be seen, this refinement causes the iterative procedure to converge to the 
same point for all the starting values. This diagnosis is confirmed by a plot of 
the minimands. Figures 3.1 contains a plot the minimand for the case in which 
Wr = 10°J;. As can be seen, Qr(0) is very flat in the dimension of y. 


15 See Section 9.1 for an empirical example in which this method does not work well and 
so an alternative routine is employed. 

16 This is an example of the problem noted above. The value of the objective function was 
of a lower order of magnitude than the convergence criteria. 


62 GMM Estimation 


Table 3.1 
First step estimation results for the consumption-based asset 
pricing model with equally weighted returns 


Wr = 10ř5; : 
Starting values €M (4,6) TQr(6) 
(0.5, 0.5) 10-4,10-°,10-® ( -3.145, 0.999) 5.974 
( -0.5, -0.5) 1074 ( -0.334, 0.994) 6.064 
10-°, 1076 ( -3.145, 0.999) 5.974 
( 5.5, 5.5) 1074, 10753,1076 ( -3.145, 0.999) 5.974 
(-5.5, -5.5) 1074, 10753,1076 ( -3.145, 0.999) 5.974 
-1 5>T 1—1 
Wr = (T Jh 22): N g 
Starting values €M (4,0) TQr(0) 
( 0.5, 0.5) 1074 ( 0.500, 0.993) 0.031 
1075, 1076 ( 0.398, 0.993) 0.031 
(-0.5,-0.5) 1074, 10753,1076 ( 0.398, 0.993) 0.031 
( 5.5, 5.5) 1074,1075,1076 ( 0.398, 0.993) 0.031 
(-5.5,-5.5) 1074, 10753,1076 ( 0.398, 0.993) 0.031 


on 


First Step Minimand 


-10 
P 0.6 ; 


Figure 3.1: Minimand with Wr = 105I; for the consumption-based asset 
pricing model with equally weighted returns 
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A similar problem emerges when Wr = (T~! 7/_, %12/)71, but again it disap- 
pears when the convergence criterion is tightened. The shape of the minimand 
is qualitatively similar to that in Figure 3.1 and so the plot is omitted. Al- 
though, we have convergence for each choice of weighting matrix, the parameter 
estimates are clearly very sensitive to this choice. In one case the estimated 
relative risk aversion of the representative agent (1 — Ẹ) is 0.602 and in the 
other it is 4.145. This discrepancy illustrates the motivation for estimation with 
the optimal weighting matrix. However, we must delay a presentation of those 
results until Section 3.6. 


We now consider the estimation of the model with VWR. The results are 
presented in Table 3.2. From Table 3.2, it is clear that the same problems are 
encountered as before with Wr = 10575. It can be seen from Figure 3.2 that 
the minimand has qualitatively the same shape with VWR as it did with EWR. 
Once again, the results are sensitive to the choice of weighting matrix. 


Table 3.2 
First step estimation results for the consumption-based asset 
pricing model with value weighted returns 


Wr = 10K : 
Starting values €M (4,6) TQr(6) 
(0.5, 0.5) 1074 ( 0.503, 0.994) 0.388 
1075, 1076 (-1.871, 0.998) 0.338 
(-0.5,-0.5) 1074, 1075 (-0.348, 0.996) 0.359 
1076 (-1.871, 0.998) 0.338 
( 5.5, 5.5) 10-*,10-°,10-® (-1.871, 0.998) 0.338 
(-5.5,-5.5) 10-4,10-°,10-® (-1.871, 0.998) 0.338 
-1 5>T \-1 
Wr = (T SD 3 Be). 3 
Starting values €M (4,6) TQr(6) 
all * 1074,1075,107 (0.698, 0.994) 0.003 


Notes: * all = ( 0.5, 0.5), (-0.5,-0.5), (5.5,5.5), (-5.5,-5.5) 
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Figure 3.2: Minimand with Wr = 105I; for the consumption-based asset 
pricing model with value weighted returns 


3.3 The Identifying and Overidentifying 
Restrictions 


The definition of the GMM estimator in (3.11) does not require f(.) to be dif- 
ferentiable with respect to 0. In some cases this generality is useful, but it is 
unnecessary in nearly all the models in Table 1.1. When f(.) is differentiable 
then the estimator can be defined equivalently as the solution to the first order 
equations in (3.12). This might appear a minor difference but it is important 
because it facilitates a Method of Moments interpretation for GMM. Just as 
in the linear model, this interpretation leads to a decomposition of the popu- 
lation moment condition into identifying and overidentifying restrictions. As 
shown in Chapter 2, this decomposition can be very useful for understanding 
the properties of GMM and it also plays an important role in the construction 
of diagnostics for the adequacy of the model specification. Similar dividends are 
reaped in the nonlinear model and so now we extend this decomposition to any 
models which satisfy the differentiablity conditions of Assumption 3.5. 

An inspection of (3.12) reveals that the GMM estimator based on E[f (vt, 
60)|= 0 can be interpreted as a Method of Moments estimator based on 


F(09)'W'/*E[f (v4, 00)] = 0 (3.18) 
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where F(0)) = W?E[|ð f(v 0o)/30']. Equation (3.18) states that 
W'*/E[f(v;, 90)] lies in the null space of F (0o)', and implies rank{ F (89) } linear 
combinations of the transformed moment condition are set to zero. Assumption 
3.6 guarantees this rank equals p and so, as in the linear model, the Method of 
Moments interpretation emphasizes the fundamental connection between iden- 
tification and estimation. However, this time there is a slight difference. In the 
linear model, the concepts of local and global identification are identical but this 
is not the case in nonlinear models as seen in Section 3.1. The form of (3.18) in- 
dicates that it is the local version which is important here. The p parameters are 
only locally identified if the estimation is based on p linearly independent equa- 
tions. The nature of this connection coincides with the our earlier definitions of 
the two types of identification. Local identification implies the population mo- 
ment condition is satisfied uniquely at fọ in a suitably defined neighbourhood. 
In this case, (3.18) has a well-defined solution at 0). However, there may be 
other points in the parameter space at which (3.18) has well-defined solutions — 
this eventuality is only ruled out if ĝo is globally identified. 

If p = q then (3.18) is equivalent to E[f (v+, 0o)] = 0, and we note paren- 
thetically that this means the weighting matrix plays no role in the analysis. 
However, if g > p then there is a difference between information used in esti- 
mation and the original population moment condition. Since (3.18) is essen- 
tially the same structure as (2.10), we can repeat the same arguments here to 
show the population moment condition can be decomposed into identifying and 
overidentifying restrictions associated with GMM estimation. The identifying 
restrictions are!" 


F(0o)[E (00) F(00)| + F (00) WP Elf (vz, 00)] = 0 (3.19) 


These restrictions characterize the part of the transformed population moment 
condition used in estimation. Formally, (3.19) states that the least squares pro- 
jection of W'/? E[f (v+, 8o)] onto the column space of F (8o) is zero, and thereby 
places rank{ F'(09)[F(09)’ F (00) 71F(00)'} = p restrictions on the transformed 
population moment condition. The overidentifying restrictions represent the 
remainder and so by definition are!® 


{I — F (00) [F (00) F (00) F (00) }W'/? E[f (vz, 9)] = 0 (3.20) 


Equation (3.20) states that the projection of W!/?E[f(v;,00)| on to the or- 
thogonal complement of F'(@9) is zero, and thereby places q — p restrictions on 
the transformed population moment condition. Notice that the identifying and 
overidentifying matrices have the same projection matrix structure encountered 
in the linear model, and so are orthogonal in nonlinear models as well. 

The roles of the two sets of restrictions are reflected in their sample counter- 
parts. Since the identifying restrictions represent the information used in esti- 
mation, their sample analogs are satisfied at Or by construction. In contrast, the 


17 This terminology is introduced by Sowell (1996) who first characterized the identifying 
restrictions. 

18 This terminology is introduced by Hansen (1982) who first characterized the overidenti- 
fying restrictions in this context. 
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overidentifying restrictions are ignored in estimation and so their sample analog 
is not satisfied. However, they can be used to give a useful interpretation to the 
GMM minimand. From (3.12), it follows that 


T: 
WIPT Y fu, Or) = (ly — Fr(ôr)[Fr(Êr) Fr (ôr) Fr (6r)'} x 
T 
WHT son) ga 


where Fr(0) = WPT Dor Of (v+, 0)/36', and so the transformed estimated 
sample moment is the sample analog to the function of the data appearing 
in the overidentifying restrictions.!? Therefore, Qr(Êr) can be interpreted as a 
measure of how far the sample is from satisfying the overidentifying restrictions. 


3.4 Asymptotic Properties 


In the linear model, the asymptotic analysis rested crucially on a closed form 
expression for Êr. However, as discussed in Section 3.2, such a representation 
typically does not exist in nonlinear models and so it is necessary to develop 
a different strategy of proof. As it turns out, the difference is most marked in 
the proof of consistency. Once consistency is established then it is possible to 
invoke the Mean Value Theorem to obtain a representation for Êr — 09 which 
facilitates the derivation of asymptotic normality along very similar lines to the 
argument used in the linear model. Hansen (1982) establishes these properties 
in his original article. Newey and McFadden (1994) and Wooldridge (1994) 
provide very useful treatments of the asymptotic analysis of a wide variety of 
econometric estimators. Our discussion takes advantage of their results and the 
reader is refered to these sources for some of the more technical details. 

Before developing the asymptotic analysis it is necessary to place a further 
restriction on v+. Recall from Section 1.4.2 that stationarity, by itself, is insuf- 
ficient to allow the application of Laws of Large Numbers and Central Limit 
Theorem. Therefore we now impose the following. 


Assumption 3.8 Ergodicity 
The random process {v4; co < t < co} is ergodic. 


A formal definition of ergodicity involves rather sophisticated mathematical 
ideas and is beyond the scope of this book. Instead we refer the interested 
reader to Davidson (1994) [pp.199-203] or Spanos (1999) [pp.424-6]. It is suf- 
ficient for ergodicity that the dependence between 1, and vim decreases at a 
certain rate to zero as m — oo. If v exhibits this behaviour then it is called a 
mizing process. This type of assumption has received a lot of attention in the 
econometrics literature because it can be used to underpin asymptotic analysis 


19 This assumes Wp is positive definite. 
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in either stationary or nonstationary environments, and so is more general than 
ergodicity which can only be used for stationary series. Further discussion of 
these issues here would constitute a major detour and would distract us from the 
main purpose of this chapter. Therefore, we provide a heuristic introduction to 
mixing processes in Appendix A. This appendix also contains a brief summary 
of the literature on GMM in a nonstationary environment. 


3.4.1 Consistency of the Parameter Estimator 


Even though there is no closed form expression for Êr, it is clearly defined by 
(3.11). The key to a proof of consistency is the consideration of what happens 
if we perform a similar minimization on the population analog to Qr(0), 


Qo() = {Elf (ve, OY W {Elf (vr, O)]} (3.22) 


The answer follows directly from our earlier assumptions. The population mo- 
ment condition implies Qo(00) = 0. The global identification condition and the 
positive definiteness of W, imply Qo(0) > 0 for all 6 4 69. Taken together these 
two properties imply Qo(@) has a unique minimum at 6 = 0. Intuition suggests 
that if: (i) Êr minimizes Qr(0); and (ii) Qr(0) converges in probability to a 
function, Qo(@), whose unique minimum is at ĝo; then Êr must converge in prob- 
ability to 69. In essence this intuition is correct but there is one mathematical 
detail which needs to be taken into account. It is not necessarily the case that 
the minimum of a sequence of functions converges to the minimum of the limit of 
the sequence of functions. For this to be the case, it is sufficient that Q-7(@) con- 
verges uniformly to Qo(0).?? This property is not guaranteed by Assumptions 
3.1-3.8 and we must impose the following two additional restrictions. 


Assumption 3.9 Compactness of © 
O is a compact set. 


This compactness assumption strictly requires the knowledge of bounds on 6o 
which is typically unavailable. However, this is often ignored in practice be- 
cause these bounds can be assumed to be sufficiently large not to impact on the 
construction of the estimator.?! The only other additional assumption is the re- 
quirement that f(v:,@) is bounded by a function with finite expectation for all 6. 


Assumption 3.10 Domination of f(v:, 9) 
E|supoce|| f (ve, O)|] < œ 


With these assumptions imposed, it is possible to deduce uniform convergence.?? 


20 This property is not guaranteed by pointwise convergence of Qr(0). See Apostol 
(1974) [Chapter 9] for a useful discussion of the difference between pointwise and uniform 
convergence. 

21 Recall that a compact set is closed and bounded; see Apostol (1974) [Chapter 3]. Newey 
and McFadden (1994) discuss the potential for proving consistency without the imposition of 
compactness. Also see Pötscher and Prucha (1997) [Chapters 3 and 4]. 

22 For example, see Newey and McFadden (1994) [Theorem 2.6], Wooldridge (1994) [The- 
orem 4.1] and the references therein. 
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Lemma 3.1 Uniform Convergence in Probability of Q7(@) 
If Assumptions 3.1, 3.2, 3.7-3.10 hold then supgee |Qr(0) — Qo(6)| È 0. 


Once uniform convergence is guaranteed, then consistency can be estab- 
lished. 


Theorem 3.1 Consistency of the Parameter Estimator 
If Assumptions 3.1-3.4 and 8.7-3.10 hold then Or 2, 0o. 


For completeness we now provide a more formal proof of this theorem. It is 
most convenient to break the proof down into two parts. First, it is shown that 
the conditions of the theorem imply: 


jim P[0 < Qo(6r) < €] = 1 for any € > 0 (3.23) 
This equation states that 67 minimizes Qo(0) with probability one as T —> 
co. The second part of the proof shows formally that this property implies 
consistency. 
Part (i): Proof of (3.23). 


This result is deduced from the following three statements about Qr(.) and 
Qo(.) implied by uniform convergence and the definition of the estimator. 


(a): Lemma 3.1 states that the difference between Q7(0) and Qo(@) disappears 
with probability one as T — oo at any value of 0 € ©. Now, by definition 
Êr € ©, and so Lemma 3.1 implies limr_... P[| Qo(6r) — Qr(6r)| < 
€/3] = 1 for any constant e€ > 0.23 This implies in turn that 


jim, P[Qo(6r) < Qr(6r) + €/3] = 1. 


(b): Since Êr minimizes Qr(0) it follows that 
jim, PIQr(6r) < Qr(G0) + €/3] = 1. 


(c): By similar reasoning to part (a), it follows that 
jim, PlQr(90) < Qo) + «/3] = 1. 
A combination of the probability statements in (a) and (b) yields 
jm, P[Qo(6r) < Qr(80) + 2€/3] =1 
and this statement can be combined with (c) to deduce 
jim, PlQo(6r) < Qo(4) +4 =1 


23 The division of e by three is for notational convenience below and has no substantive 
impact on the argument. 
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Equation (3.23) then follows immediately because Assumption 3.3 implies 
Qo(f0) = 0 and the positive definiteness of W implies Qo(8) > 0. 

Part (ii): (3.23) > Ôr > bo. 

Let N be an open subset of © which contains 6) and N° be the complement 
of N relative to ©. By definition N° is a closed subset of a compact set and so 
is itself compact.?4 Since N° is compact and Qo(@) is a continuous function it 
follows that Qo(@) has an infimum on N°, which we denote by infgene Qo(0). 
From Assumption 3.4, it follows that this infimum is strictly positive. Therefore 
we can substitute € = infgene Qo(@) in (3.23) to deduce 


lim P[Qo(6r) < Pua Qo(8)] = 1 


T—o0 


This implies limp_,,, P[Îr ¢ N°] = 1 and hence that limp_,., P[Êr € N] = 1. 
Finally, since the above argument holds for any choice of N no matter how 
“small”, it must follow that limy_.., P [or = ĝo] = 1 which is the desired result. 
° 

Notice that the conditions for Theorem 3.1 placed no restrictions on the 
derivative matrix Of (v:,0)/00’. It is true that we have refered to this derivative 
matrix in previous sections but its role has not been crucial. It was used to 
obtain a condition for local identification in models which satisfied Assumption 
3.5; however the concept of global identification did not require its existence. 
The derivative matrix also played a role in the discussion of numerical opti- 
mization. However, as mentioned above, Qr(@) can be minimized by search 
methods which do not require the calculation of the gradient. As we shall see 
in the next sub-section, the derivative matrix plays a more central role in the 
proof of asymptotic normality of the estimator. 


3.4.2 Asymptotic Normality of the Parameter Estimator 


To develop the asymptotic distribution of the estimator, we require an asymp- 
totically valid closed form representation for T!/2(67 — 8o). This representation 
comes from an application of the Mean Value Theorem.?° This theorem relates 
f(.) to its first derivatives Of(v;,0)/00’ and so it is pica to ipone As- 


sumption 3. ae 26 To simplify the presentation, define gr(@) = T7} Se Lf (ve, 0) 
and Gr(0) = T7! De Of (v1, 0)/30'. The Mean Value ae implies that 
gr (Or) = gr (90) + Gr(Or, 0o, Av) (Or — 90) (3.24) 


where Gr(6r, ĝo, Ar) is the (q x p) matrix whose it row is the corresponding 


row of Gr) where at = Arj00 + (1 — Ara)Or for some 0 < Ap; < 1, and 


24 See Apostol (1974) [pp.50-3]. 

25 See Apostol (1974) [p.355]. 

26 Similar results can be developed for non-differentiable f(vz,0) in cases where E[f (vz, 0)] 
is differentiable; see Newey and McFadden (1994) [Section 7]. 
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Ar is the (q x 1) vector with it element Ar. Premultiplication of (3.24) by 
Gr(Or)'Wr yields 


Gr (61) Wrgr (Or) = Gr(6r)'Wrgr (90) + Gr(Or)' WrGr (Gr, 8o, Xr) (Or — 9) 

(3.25) 
Now the first order conditions in (3.12) imply the left hand side of (3.25) is zero 
and so with some rearrangement it follows from (3.25) that 


T'/?(67 —00) = —|Gr(6r)'WrGr(6r, 8o, Ar) Gr (Or) Wr? gr(80) 
= —MrT'/? g7(o), say. (3.26) 


Notice that this equation has the same basic structure as arose in the linear 
model at this stage: a random matrix, — Mr, times a random vector, T!/?g7 (60). 
Just as in Section 2.3, we start by analyzing the limiting behaviour of these 
two components separately and then combine them to deduce the asymptotic 
distribution of the estimator. The asymptotic behaviour of T!/?gr(0o) is given 
by a version of the Central Limit Theorem. To apply the Central Limit Theorem, 
it is necessary to assume the second moment matrices of the sample moment 
satisfy certain restrictions.?” 


Assumption 3.11 Properties of the Variance of the Sample Moment 
(i) El f(v:,90)f (vt, 90)'] exists and is finite; (ii) limp..Var|T'/?g7(00)] = S 


exists and is a finite valued positive definite matrix. 
The Central Limit Theorem is as follows. 


Lemma 3.2 Central Limit Theorem for T'/?g7(6) 
If Assumptions 3.1, 3.3, 3.8 and 3.11 hold then T'/?g7(00) Bia N(0,S). 


The analysis of Mr is more complicated than in the linear model because it 
depends on Gr(6r) and Gr(6r, 60, Ar). Since Ôr & 0o and a lies on the line 
segment between Êr and Oo, then it follows that at? 2, Bo fori = 1,2...p. Intu- 
ition suggests that this should imply both Gr(6r) and Gr(6r, 00, Ar) converge 
in probability to Go = EJƏf (v, 0o)/30'] . In essence this is correct, but the 
argument can only be formally justified if we impose two further restrictions on 


Of (v, 0)/06' 28 


Assumption 3.12 Continuity of [Of (v:,0)/06’] 
E|Of (vı, 0)/30'] is continuous on some neighbourhood N, of 0o. 


Assumption 3.13 Uniform Convergence of Gr(0) 
supgen, ||Gr() — EOF (ve, 4)/00)| > 0.7° 


27 See Hansen (1982) for more primitive conditions for such an S to exist. We do not give 
these conditions here because they are superseded in the next section by the more restrictive 
conditions under which S can be consistently estimated. 

28 See Newey and McFadden (1994) [p.2145]. 

29 For any matrix A, we define ||A|| = [tr(A”A)]!/2. 
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With these assumptions imposed - and, of course, the conditions for the con- 
sistency of Or — it is possible to deduce the following. 


Lemma 3.3 Convergence of Gr(6r) and Gr(6r, bo, Ar) 
If Assumptions 3.1-3.5, 3.7-3.10, 3.12 and 3.13 hold then Gr(6r) 2 Go and 
Gr (Or, 90, Ar) > Go. 


Lemma 3.3 can be combined with Assumption 3.7 and Slutsky’s Theorem 
to deduce that Mr  (G,WGo)-!G,W. Therefore just as in the linear model, 
T*/2 (67 — Oo) is asymptotically the product of a random matrix which converges 
in probability to a constant, and a random vector which converges to a normal 
distribution. Therefore, the desired result follows once again from Lemma 1.4. 


Theorem 3.2 Asymptotic Normality of the Parameter Estimator 


If Assumptions 3.1-3.5 and 3.7-3.13 hol@® then: T*/2 (Ôr —o) 4 N(0,MSM') 
where M = (GhWGo)!GoW. 


Theorem 3.2 implies that an approximate 100(1 — a@)% confidence interval for 
90,; in large samples is given by 


Ôr. E Za/2\/ Vri /T (3.27) 


where Vri is the 7 — i“ element of a consistent estimator of MSM’. As in 
the linear model, a natural candidate is based on consistent estimators of the 
component matrices M and S. Notice that this time the matrix Mr cannot 


th 


be used because although consistent, the values of {i = 1,2...p} are un- 
known. However this problem is easily circumvented by replacing a with Êr 
and using Mr = [Gr (6r)'WrGr(6r)| Gr (6r)'Wr to estimate M. However, 
the consistent estimation of S' is more complicated and is the topic of Section 
3.5. 

As we have seen, Theorem 3.2 rests on an application of the Mean Value 
Theorem. The latter can only be applied if 0ọ is an interior point of ©. It 
should be noted that if 0) is on the boundary then the limiting distribution 
theory is different. Since this situation is not common, we do not pursue it 
further here but refer the interested reader to Andrews (2002a). 

To conclude this sub-section, we briefly return to the decomposition of the 
population moment condition into identifying and overidentifying restrictions. 
In Section 3.3, these components are defined and their role explained, but no 
intuition is offered for why they take these particular forms. It is now possi- 
ble to remedy this omission because an intuition can be developed from the 
relationships used to deduce the asymptotic distribution. 

The derivation of asymptotic normality began with (3.24). This equation is 
formally justified from the Mean Value Theorem and holds for any Ôr. How- 
ever, an inspection of the subsequent analysis indicates that we would have 


30 Assumption 3.6 is only omitted because it is implied by Assumption 3.5. 
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obtained the same asymptotic distribution if instead we had confined attention 
to a sufficiently small neighbourhood around 69 for which 


T 2 97 (8) = TY? gr (80) + Gr(0o)T/? (0 — 00) (3.28) 


In other words, for the purposes of the asymptotic distribution theory it is 
sufficient to concentrate on the behaviour of the sample moment in the neigh- 
bourhood of 6o for which T'/?g7-(6) is a linear function of T!/?(6—69). If we con- 
centrate on this neighbourhood for the analysis of the minimization of TQ7r(6) 
as well, then the identifying restrictions emerge naturally from the structure of 
the problem. Using (3.28), the GMM minimand in this neighbourhood can be 
rewritten as 


TQr(8) = [W7 T" gr (0)]]? = [W7 Tgr (00) + Fr Go)T'/? (8 — 60) 
(3.29) 
where, as before, Fr(0) = WPT! 57 Of (ve, 0)/30'. Therefore if Êr mini- 
mizes Qr(0) in this neighbourhood then T!/2(Êr — 69) must also be the least 
squares solution to 


Wy? 7"? gr (00) + Fr(0o)T™?(0 — 0o) =0 (3.30) 


The least squares solution to the inconsistent set of equations in (3.30) is found 
by solving the consistent set of equations?! 


Pr (00) WEPT" ?2gr(00) + Fr(0o)T'/2(8 — 69) = 0 (3.31) 


where Pr(@) = Fr(0)[Fr(0) Fr(0)|~'Fr(0)'. Since the properties of the pro- 
jection matrix and (3.28) in turn imply 


II 


Pr(00){W1 °T"? gr(00) 
+Fr (0o) T"? (0 — 00)} 
= Pr(0o) W} T" gr(0) 


Pr(0o)WE PT? gr (00) + Fr(0o) T" (0 — bo) 


it follows that the least squares solution to (3.30) must also set 
Pr (G0) Wr? TY? gr (0)? (3.32) 


to zero. Equations (3.28)—(3.32) show that the identifying restrictions possess 
their projection matrix form because, for the purposes of asymptotic distribution 
theory, the estimation can be considered as being based on a linearization of 
the sample moment condition in the neighbourhood of 6). Finally, note that 
the least squares solution to (3.30) is 


T? (Ôr — bo) = — [Fr (80) Fr (00)] Fr (00) W3 PT ?gr (00) (3.33) 


Equation (3.33) is easily verified to be asymptotically equivalent to the formula 
in (3.26) from which we deduced the asymptotic normality of the estimator. 


31 For example, see Strang (1988) [p.156]. 
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3.4.3 Asymptotic Normality of the Estimated Sample 
Moment 


It is shown in Section 3.3 that the estimated sample moment represents a source 
of information about whether the overidentifying restrictions are satisfied in 
the population. This property is exploited elsewhere to develop a test of the 
hypothesis that the model is correctly specified.*? At this stage, we confine our 
attention to deriving the asymptotic distribution of Wz! ?T1/29r (Êr) in correctly 
specified models. 

Equation (3.24) implies 


WPT? gr(6r) = We? TY? 97(60) + Wr’? Gr(Ôr, 40, Ar)T™? (Ôr — 00) 
(3.34) 
If we substsitute for T!/?(6p — 0o) from (3.26) then (3.34) can be written as 


WIT" ?gr(Ôr) = Nr(Or)Wel? TY? 97 (Go) (3.35) 

where 
A A a A A 1 
Nr(6r)=Iq — Wy! Gr (Or, Oo, Ar) [Gr (Ôr WrGr(6r, bo, kr)|"'Gr (Gr) Wl? 


Equation (3.35) implies W7/?T!/2g7(6r) has the same generic structure as the 
expression for T!/2(67—0o) in (3.26) namely: a random matrix times the random 
vector, T!/?g7(09). Therefore we can use the same arguments as Section 3.4.2 
to deduce the following result. 


Theorem 3.3 Asymptotic Normality of the Estimated Sample 
Moment 

If Assumptions 3.1-3.5 and 3.17-3.13 hold then: Wy?T/2 97 (67) £ N(0, NW? 
SW"? N’) where N = |I; — P(80)] and P(80) = F (80) [F (80) F (00) tF (00V. 


The connection between the estimated sample moment and the overidentify- 
ing restrictions manifests itself in the asymptotic distribution. Equation (3.35) 
implies that 


WPT gr (Or) = Uq — P(00)]W™ T ?gr(00) + op(1) (3.36) 


Inspection of (3.36) reveals that the asymptotic behaviour of the estimated 
sample moment is governed by the function of the data which appears in the 
overidentifying restrictions. Therefore, the mean of the asymptotic distribution 
in Theorem 3.3 is zero because the overidentifying restrictions are satisfied at 
0o. This relationship also has an impact on the properties of the variance of 
the limiting distribution. Since W'/? and S are nonsingular, it follows that®* 
rank{NSN'} = rank{Iq — P(60)} = q — p, and so the covariance matrix is 
singular.** This rank is easily recognized to be the number of overidentifying 
restrictions. 


32 See Section 2.5 and Chapter 5. 
33 See Dhrymes (1984) [p.17]. 
34 See Rao (1973) [Chapter 8] for a discussion of the singular normal distribution. 
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3.5 Long Run Covariance Matrix Estimation 


So far, very little has been said about S except that it exists and is positive 
definite. The latter is the matrix generalization of the requirement that a scalar 
variance be positive. It is important that the estimator also exhibits this prop- 
erty or is positive semi-definite at the very least; otherwise the estimated vari- 
ances of the individual coefficient estimators can be negative. This is not always 
such a trivial property to impose and is one aspect of the various estimators 
upon which we focus below. 

To understand more about the structure of S, it is useful to rewrite its 
definition as follows, 


T 
S = jim Var[T "$ fi] 


t=1 


T T 
(meS a: miry s) y 
t=1 t=1 
T T 4 
a Yh- ET YY sa) 
t=1 t=1 


where to simplify notation we have set fe = f(v, 0o). Since 


= lm E 


T—oœ 


T T T 
BEY Len! Sp ey ee) 
it follows that 
T T 
S = lim ERT S e- EUDHT Y Sof — EADY] 
L E J (3.37) 
= lim E[T- 2 dh — EJF) (Fs — Elf] l 


The stationarity assumption implies that E[(f; — E[ft])(f:-; — Elfi-;)'] = F3, 
say, for every t and so?” 


T—1 Z oo 
im {> T-j > 
J= 


i=l 


The matrix I; is known as the j’” autocovariance matrix of f+. From (3.38) 
it is clear that estimation of S is going to require assumptions about these 
autocovariance matrices. 


35 For example, see Hamilton (1994) [pp. 279-80]. 
36 See Hamilton (1994) [pp.261-2] for a discussion of the properties of autocovariance 
matrices. 
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The long run covariance matrix estimation literature has focused on the ways 
to avoid any potential inconsistency caused by inappropriate assumptions about 
the dynamic specification of {f (v+, 0o)}. Therefore, nearly all the contributions 
to this literature develop the properties of the estimator in question under the 
assumption that the model is correctly specified and so E|f (v+, 0o)] = 0. We 
maintain this assumption throughout the section. However, it should be noted 
that if this assumption is inappropriate then all the estimators discussed below 
are inconsistent. In other words, the consistency of a covariance matrix estima- 
tor depends on the validity of the assumptions about both the mean and dynamic 
structure of {f(vz,00)}. It might be felt that little concern need be attached 
to any inconsistency caused by E[f(v:,40)| 4 0 because once it is recognized 
that the model is misspecified then there is typically no interest in construct- 
ing confidence intervals for 69. However, the use of an inconsistent covariance 
matrix estimator has a detrimental effect on the properties of certain tests for 
misspecification, and may in turn affect the properties of moment selection pro- 
cedures based upon these tests.?” This motivates the use of covariance matrix 
estimators which are consistent even if the model is misspecified. Fortunately, 
there is a simple way to modify the estimators discussed here to achieve that 
end. However, we delay further discussion of this topic until Section 4.3. 

In this section we describe estimators which have been proposed under three 
different sets of assumptions about the dynamic structure of f+. The first is 
where {f;,} forms a serially uncorrelated sequence. This type of restriction oc- 
curs in some of the models listed in Table 1.1 and so this case is treated sepa- 
rately in Section 3.5.1. The remainder of the section considers the more general 
case in which f; is serially correlated. Two main approaches have been taken. 
The first assumes that f; is generated by a vector autoregressive moving average 
(VARMA) process and is reviewed in Section 3.5.2. This approach has the ad- 
vantage that the autocovariances can be estimated straightforwardly from the 
parameters of the VARMA model. The potential disadvantage is that if this 
model for f is incorrect then the resulting estimator of S may be inconsis- 
tent. The second approach uses a member of the class of heteroscedasticity and 
autocorrelation covariance (HAC) matrix estimators and these are described in 
Section 3.5.3. These estimators are consistent under the much weaker conditions 
on {fi}. Unfortunately, these more general estimators can exhibit poor finite 
sample performance and this prompted the construction of prewhitened and re- 
coloured HAC estimators. Initial evidence suggests this latter version performs 
better and so it is also described in Section 3.5.3. 

Our discussion of covariance matrix estimation is less rigorous than the 
analysis in the previous sections. Instead we focus on the intuition behind 
the various methods and describing both their strengths and weaknesses. All 
the estimators can be established to be consistent under appropriate conditions 
but the reproduction of these very technical results is beyond the scope of this 
text. Instead, we refer the interested reader to the appropriate sources for a 
catalogue of the required regularity conditions and rigorous proofs of the stated 


37 See Chapters 5 and 7 respectively. 
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results. As we shall see, there is plenty to discuss even without this more formal 
analysis! 


3.5.1 Serially Uncorrelated Sequences 


If {f+} is a serially uncorrelated sequence then I’; = 0 for j # 0 and so it follows 
from (3.38) that S is given by 


S = Ssu = Elf fi] (3.39) 


where we have used the SU subscript to distinguish this S from the cases con- 
sidered below. The form of Ssu is essentially the same as the S matrix in (2.27) 
and a similar logic leads to the estimator?’ 


r 
Ssu = T S Af (3.40) 
t=1 
where f = f(v, Or). It can be shown that Ssy > Su; e.g. see White 
(1994) [Theorem 8.27, p.193]. Notice that this estimator is positive semi-definite 
by construction because 
Ssu = T'H'H (3.41) 
where H is the (T x q) matrix with t*” row f. 
In the types of models in Table 1.1, this type of behaviour occurs because the 
underlying theory implies {f+} is a martingale difference sequence with respect 
to the information set 04-1 = {fi-1, fr_-2,--- f1}. Such a process satisfies both 
E|f:] = 0 for all ¢ and also 


Elfe] =0 fort =2,3... (3.42) 
Consequently, for t > s, we have E[f;f.|Qu—1] = El f;|Qr—1] f; = 0 which implies 


Elfefs] = E[E[fe\Qx-1] fz] = 0 (3.43) 


3.5.2 VARMA Processes 


If fe is generated by a stationary and invertible vector autoregressive moving 
average (VARMA) model of order (m,n) and E|[f;,] = 0, then it has the following 
representation’? 

(L) fi = O(L)e (3.44) 


in which {e+} is a sequence of independently and identically distributed random 
vectors with E[e;] = 0 and Var[e;] = ©. The (q x q) matrix polynomials, ¥ (L) 


38 Also see Section 4.3. 
39 See Hamilton (1994) [Chapters 10 and 11] for an introduction to vector time series models 
and Reinsel (1993) for a more elaborate discussion of VARMA models. 
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and ®(L) are respectively of orders m and n; Y(L) contains the autoregressive 
parameters of the system and ©(L), the moving average parameters. The re- 
strictions on the parameters implied by the terms “stationary and invertible” 
are important for our discussion and so worth a brief explanation. A VARMA 
process is stationary if the roots, {s;, i = 1,2,...m}, of the characteristic equa- 
tion det{U(s)} = 0 are all outside the unit circle. This implies that fp has a 
VMA (c0) representation,*° 


fe = {V(L)} O (L)er (3.45) 


The process is invertible if the roots, {sž, i = 1,2,...n}, of the characteristic 
equation det{®(s*)} = 0 are all outside the unit circle. This implies fẹ has a 
VAR(oo) representation,*? 


{O(L)}“'W(L) fp = A(L) fi = e (3.46) 


where A(L) = I4 — AiL — AL? —... is a (q x q) matrix polynomial of infinite 
order. 

Now let us return to the construction of a consistent estimator for S. From 
(3.45) it follows that*? 


S = Svarma = {¥(1)}-' (1) 5 @(1)/{H(1) 4 (3.47) 


where W(1) = I, + 30", Y; and ®(1) = I, + OL, @;. This matrix can be 
consistently estimated by 


Svarma = {W(1)}71@(1)S6(1)' {0 (1) (3.48) 
where (1) = Ig +0”, Wi, (1) = Lg +0”, Ê; and {5, U;, 6; ;i = 1,2,...m; 
j = 1,2,...n} are consistent estimators of {X}, Y;,®;j;i = 1,2,...m; j = 


1,2,...n}. Since f, is unobserved, these parameter estimates are obtained by 
estimating a VARMA model for fi The estimator of X is of the form È = 
F= 5S ê,ê, and so Sv ARMA İS positive semi-definite by construction. The 
estimation of VARMA models can be performed using generalized least squares 
or maximum likelihood; see Reinsel (1993) [Chapter 5]. However, it is compu- 
tationally burdensome due to the presence of the MA terms. Various methods 
have been proposed for circumventing this problem in the context of covariance 
matrix estimation. Eichenbaum, Hansen, and Singleton (1988) and West (1997) 
suggest methods which can be employed if f, follows a VARMA(0,n) process. 
Although, the absence of the autoregressive component can be justified in some 
of the models listed in Table 1.1, we do not review these procedures here. In- 
stead, we focus on a more general method proposed by den Haan and Levin 
(1996) which can be applied when f; follows a VARM A(m, n) process.*? 


40 See Hamilton (1994) [pp. 259-61]. 
41 Tbid. [p. 263]. 

42 Ibid. [pp. 276-84]. 

43 Also see den Haan and Levin (1997). 
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To motivate den Haan and Levin’s method, it is useful to rewrite SVARMA 
in terms of the coefficients in the VAR(co) representation. From (3.46) it follows 
that 

Svarma = {A(1)}7' E{A(1)}7* (3.49) 
This suggests an alternative approach is to estimate S using the coefficients 
from the VAR(co) representation and thereby avoid the computational prob- 
lems associated with the estimation of MA terms. There is just one snag: it 
is impossible to estimate an infinite order autoregressive model from a finite 
sample. To circumvent this problem, den Haan and Levin (1996) propose ap- 
proximating (3.46) by a finite order VAR model whose order increases with the 
sample size. To implement this method in practice, it is necessary to choose the 
order of this approximation. Den Haan and Levin (1996) recommend this choice 
is made via a data-based model selection criterion. Specifically, they propose 
the following method for the estimation of Sy ARM act 


Den Haan and Levin’s Method 


1. Calculate ¥(0 )= TIN e ÊP. 


2. Estimate the model 


f= Alk) fiir +... + Arlk)Ê-r + elk) (3.50) 


for k = 1,2,...K andt = K +1,K +2...T by least squares where 
fi = f(v, 0r). These estimates are given by 


T A T 
X frd DO rrp 


t=K+1 t=K+1 


where A(k) = (Aı(k), A2(k),... Ar(k)) and r; = (fia ft-o-+- fx) - 
Construct the forecast error ê(k) = fi- A(k )r, and 
Ê(k) = T Dir Och Yeh)! 

3. Let k be the value of k which minimizes Schwarz’s (1978) information 
criterion 

log(T)kq? 


SIC(k) = log{det[%(k)]} + —F 


(3.51) 
over k =0,1,...K. 


4. Estimate Svarma by 


k 
Svarma = {Iq— 7 Ai(ke)}- SÉ - SA pe (3.52) 


44 Also see Section 4.2 
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To implement this method, it is necessary to choose K. Den Haan and Levin 
(1996) show SVARMA is consistent provided K — oo as T — œ and K = 
O(T*/3) but an appropriate rule for picking K in finite samples remains an 
open question. This choice has the advantage the lag selection procedure is 
consistent because if n > 0 then Å tends in probability to co as T — ow, but 
if n = 0 then k % m.*° Finally, notice that once again this covariance matrix 
estimator is positive semi-definite by construction. 

We have motivated this estimator by assuming f satisfies a VARMA model. 
However, inspection of den Haan and Levin’s method indicates that it is con- 
sistent provided the autocovariance structure of f, is equivalent to that of some 
infinite order autoregression. For this, it is only sufficient and not necessary that 
fie be a VARMA process. Den Haan and Levin provide a set of more general 
conditions under which the estimator is consistent. These conditions are very 
similar to those employed in the next section and certain parallels will emerge 
between Sy ARMA and some of the methods to which we now turn. 


3.5.3 Heteroscedasticity and Autocorrelation Covariance 
Matrix Estimators 


Unfortunately, VARMA processes may not be sufficiently general to capture the 
dependence structure of f; in all cases of interest. This has prompted the devel- 
opment of the class of heteroscedasticity and autocorrelation covariance (HAC) 
matrices which are consistent under relatively weak assumptions on the depen- 
dence structure of the process. However, it is necessary to impose some further 
restrictions beyond those already assumed in Section 3.4 for the asymptotic 
analysis. The discussion in this section rests mostly on the work of Andrews 
(1991) and Newey and West (1994), and these authors catalogue the required 
regularity conditions.*” 

To motivate these estimators, it is useful to return to the definition of S given 
in (3.38), namely S = ro + 07°, (T; +1). Given this structure, it is natural to 
estimate S by truncating this infinite sum and using the sample autocovariances, 
rj = =e Se =j+1 f fe -j as estimates of their population analogs. This leads 


to the estimator 7 
eu 
Srr=To+S (Ci +1) (3.53) 


i=1 
where “TR” stands for truncated. White and Domowitz (1984) first proposed 


45 The asymptotic theory is satisfied by the closest integer to cT*/3 for any finite positive 
constant c. 
46 Den Haan and Levin (1996) also consider using Akaike’s (1973) information criterion, 


AIC(k) = log{det[Q(k)]} + 4 aha to pick the lag length. Their theoretical analysis suggests that 
SIC is a better choice becaiiss AIC is not a consistent method of lag selection; however their 
limited simulation evidence suggests that the two criteria perform comparably in this context. 

47 These include the conditions: (i) supge n, E[|O? f (vt, 0)/30:80;|] < co for i, , j =1,2,...p 
and Ne is some neighbourhood of 69; (ii) (f (vr, 90)’, vec(Of (vt, 00)/30' — E[Of (vt, 00)/00'])’) 
has -summable autocovariances and absolutely summable fourth order cumulants, where | is 
some positive constant. 
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this type of estimator and showed its consistency in certain least squares settings 
provided ĉr — œ as T — œ and fp = o(T™3). This would appear to solve 
the problem, but does not. While Srp converges in probability to a positive 
definite matrix, it may be indefinite in finite samples. 


The source of the trouble is not the truncation but the weights given to 
the sample autocovariances in (3.53). This is most readily seen by restricting 
attention to the case where fẹ is a (—dependent process so that T; = 0 for all 
i > L, and £r = £. In this case, the correct order of the process is being used in 
the estimator but the estimator is still not positive semi-definite. This failure 
is uncovered by rewriting Spp as 


Srr = T'H'DH 


where H is the same matrix as in (3.41) and D is the (T x T) matrix whose 
only non-zero elements are D;,; = 1 for j = sı(i),...s2(i) for i = 1,2,...T 
and sı(i) = maz(i — £, 1), s2(i) = min(i + 4,T). Since D is not positive semi- 
definite, neither is Srr. It is important to realize that the failure of positive 
semi-definiteness does not always imply negative sample variances. Rather it 
means that negative variances can occur for certain realizations of H. In the 
limit, the problem disappears because all realizations from the process must 
satisfy Srp >To + 5i 1(T4 +1) which is positive definite by definition. = 
other important aspect of the problem can be learnt from this example. If £ = 
then Êr R = su and this estimator is positive semi-definite by ee 
So the problem stems from the inclusion of the sample autocovariance matrices 
Tpi =1,2,... 2}. 


The solution is to construct an estimator in which the contribution of the 
sample autocovariances matrices are weighted to downgrade their role suffi- 
ciently in finite samples to ensure positive semi-definiteness but have the weights 
tend to one as T — oo to ensure consistency. This is the intuition behind the 
class of heteroscedasticity autocorrelation covariance (HAC) matrices. This class 
consists of estimators of the form 


T-1 
Suac = Îo + 5 wir (Ê: +14) (3.54) 


i=l 


where w;,r is known as the kernel (or weight). The kernel must be carefully cho- 
sen to ensure the twin properties of consistency and positive semi-definiteness. 
The three most popular choices in the econometrics literature are given in Table 
3.3. 
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Table 3.3 
Kernels for three common HAC estimators 


Name Author(s) Kernel, wir 


Bartlett Newey and West (1987a) 1 -— a; for a; < 1 


0 for a; > 1 
Parzen Gallant (1987) 1 — 6a? + 6a? for 0 < a; < 0.5 
2(1 — a;)? for 0.5 <a; <1 
0 fora; >1 
Quadratic | Andrews(1991) por Senna) — Cos(m;) 
n d; i 


Spectral 


Note: Qi = i/ (br + 1); di = i/br; mi = 6rd; /5. 


Here “name” refers to the term by which the particular choice of kernel is most 
commonly known, and is a reference back to an earlier literature on the estima- 
tion of the spectral density at frequency zero in which these types of problems 
were first solved.48 The parameter br is known as the bandwidth, and must 
be non-negative. Notice that this parameter controls the number of autocovari- 
ances included in the HAC estimator when either the Bartlett or Parzen kernels 
are used. In these two cases, br must be an integer, but no such restriction is 
required for the quadratic spectral kernel. Which set of weights should be used? 
Andrews (1991) shows that the Quadratic Spectral weights are optimal in the 
sense that they minimize an asymptotic mean squared error criterion for the 
estimation of S. His results imply that this choice only marginally dominates 
the Parzen weights, but both should be much better than the Bartlett weights. 
This is mirrored to some extent by his simulation results for a linear model 
with some simple forms of autocorrelation and heteroscedasticity. However, al- 
though the Quadratic Spectral weights perform slightly better than the Parzen 
weights, neither dominate the Bartlett weights to the extent predicted by the 
theory. Newey and West (1994) report simulation evidence from two more gen- 
eral linear models; in one, their results corroborate Andrews’s but in the other 
they find no clear ranking is possible. Newey and West (1994) conclude that 
the choice between the kernels is not particularly important; a view for which 
there is some precedent in the earlier spectral density estimation literature.*9 
The bandwidth is a much more important determinant of the finite sam- 
ple properties of Ŝmac. For consistency, br must tend to infinity with T.50 
Andrews (1991) shows that the asymptotic mean square error is minimized by 
setting br equal to O(T!/3) for the Bartlett weights and O(T'/*) for both the 


48 See Priestley (1981) for a review of this earlier literature. 

49 See Priestley (1981) [p.574]. 

50 Newey and West (1987a) and Gallant (1987) prove the consistency of their particular 
estimators under the assumption br = 0(T1/4). Andrews (1991) and Hansen (1992) prove 
the consistency of this general class of estimators under the assumption br = o(T/2), 
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Parzen and Quadratic Spectral weights. However, again, this type of condition 
provides little practical guidance because it only restricts the optimal band- 
width for the Bartlett weights, say, to be of the form cT!/3 for any choice of 
finite c > 0. Andrews (1991) develops some procedures for picking the optimal c 
based on the assumption that f; follows certain VARMA models. However, we 
do not pursue these here because if this specification is adopted then it seems 
more reasonable to use the Sy 4 rma described in the previous section.°! Newey 
and West (1994) propose a nonparametric method for selecting the bandwidth 
and show it minimizes the asymptotic mean square error criterion. The me- 
chanics of this approach are as follows; the parameters (h,n,c,,v) are defined 
afterwards. 


Newey and West’s Method of Bandwidth Selection 


1. Use the (qx 1) vector h to construct the scalar random variable c = hi fi. 
2. Construct 6; =T~! pee Ct; for j =0,1,...n. 

3. Calculate 8) = 2977 j%6; and 8 = ôo +25; 65. 

4. Calculate ¥ = c,{{8™ /8O p} Cr+), 
5 


. For the Bartlett and Parzen kernels, set bp = int{¥T/2"+)\ where 
int{.} denotes the integer part of the number inside the brackets; for the 
Quadratic Spectral kernel, set br = 3T C”+D, 


It would be anticipated that the bandwidth depends on the autocovariances of 
Ê and close inspection of the above reveals this to be the case. However, there 
is no simple intuition for the exact nature of the calculations. The parameters 
(n, cy,v) are given in Table 3.4. 


Table 3.4 
Parameter values for Newey and West’s (1994) 
bandwidth selection method 


Weight v n Cy 
Bartlett 1. OC) 1.4117 
Parzen 2 o(T*/) 2.6614 


Quadratic Spectral 2 O(T?/%5) 1.3221 


Notice that the exact choice of n is not specified and so Newey and West’s pro- 
cedure does not completely solve the problem. They recommend that the calcu- 
lations be repeated for different choices of n to ensure the resulting confidence 
intervals or hypothesis tests are not sensitive to the choice of this parameter. 


51 Jf ft follows a VARMA process and so S = Sy arma then SV ARMA converges to this 
limit faster than Sy 4c; see Andrews (1991) and den Haan and Levin (1996). 
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To implement the method, the vector h must also be chosen. Newey and West 
(1994) focus on the case where fi = zru and suggest that if the first element 
of z is a constant then h can be set equal to (0,1,1,...1). More generally, 
the choice of h can be data dependent subject to certain conditions; see Newey 
and West (1994) [p.636]. However to date, no further guidance is available 
about either how this choice should be made or its impact on the finite sample 
properties of the covariance matrix estimator. 

In theory, the HAC estimators have solved the problem of constructing a 
consistent, positive semi-definite estimator of S under very weak conditions on 
fı. However, in practice, they often do not work well in cases of interest. Sim- 
ulation evidence suggets their use can lead to the confidence intervals in (3.27) 
which do not possess the anticipated coverage rates in finite samples; see An- 
drews (1991), Andrews and Monahan (1992) and Newey and West (1994). An 
examination of the estimation error indicates the types of circumstance in which 
this problem may be present. For ease of exposition, we restrict attention to 
HAC estimators for which wir = 0 for i > br. From (3.38) and (3.54), the 
estimation error is 


br 
S- Snae = Do- +Y wiri: - ô) +- Ep} 
i=1 
br S (3.55) 
+o — wir )(Ti + r;) ah y: (T; + ly) 
i=: i=br+1 


So there are three sources of error: (i) error from the estimation of the auto- 
covariances, {T; — Î;}; (ii) error due the weights on the estimated autocovari- 
ances, 1 — w;,r; (iii) approximation error due to the truncation of the sum, 
paar tt +I“). The best way to appreciate when these errors are large is to 
start by describing a situation in which they should be relatively small. Suppose 
ft is a 2-dependent process and £ is small relative to br. In this case it follows 
that: (i) the weights on the I’; for i < @ are very close to one; (ii) for i > £ the 
weights help to shrink the estimated covariance matrices towards their limiting 
value of zero; (iii) there is no approximation error. These three effects combine 
to produce an estimator that is reasonably accurate in finite samples. Now 
consider what happens as £ increases. Estimation error creeps in because the 
weights are substantially different from one for the longer lags less than or equal 
to £ and then once br < £, there is approximation error as well. This suggests 
that $ Hac is unlikely to perform well in finite samples if the population auto- 
covariance matrices of f; die out too slowly. Such behaviour would be observed 
if fẹ is generated by a process with a substantial autoregressive component. 
Autoregressive behaviour is a common feature of economic time series and 
so these problems motivated Andrews and Monahan (1992) to propose a mod- 
ification to the HAC estimator based on a technique called prewhitening and 
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recolouring.°2. The basic idea is to filter f, to reduce the size of its autoregres- 
sive component and hence to produce a series for which an HAC estimator works 
better. This is known as the “prewhitening” phase. The long run variance of 
the filtered series is estimated using a member of the class of HAC. Then in the 
“recolouring” phase, the long run variance of fj is estimated from the HAC and 
the properties of the filter. Andrews and Monahan (1992) recommend using a 
VAR(m) process to filter the data and so their procedure is as follows. 


Andrews and Monahan’s Procedure 


1. Estimate the VAR(m) model for fi, 
f = Ar(m) fii +... + Am(m) fim + ex(m) (3.56) 


by least squares. These estimates are given by 


J T . T 
A(m)= X fra SO rrp 


t=m+1 t=m+1 


where A(m) = (A1 (m), A2(m), ... Am(m)) and r; = (fi-a; Stas flim): 
Construct the forecast error ê&(m) = ft — A(m)rz. 


2. Construct the estimator È = Îo + ee wi ri +Ê) where 
Pi =T DA éx(m)er—i(m™m)’. 


3. The estimator of S is 


Spwac = {Iq—)) Âm Y ÈL = D7 Ail) (3.57) 


i=l 


Any value of m can be used; however, Newey and West (1994) recommend 
using g Pwreo with m = 1 and their method of bandwidth selection in step 
2. This estimator is positive semi-definite by construction and Andrews and 
Monahan (1992) prove its consistency. There are clearly close parallels with 
den Haan and Levin’s (1996) method: the main difference is that Andrews and 
Monahan use the autoregressive filter to remove some of the autocorrelation 
structure; whereas in den Haan and Levin’s method the autoregressive filter 
must remove all the autocorrelation structure with the autoregression. This 
difference manifests itself in the consistency proofs. The consistency of s PWRC 
depends mostly on the use of the HAC estimator in step 2, but the filter must 
also satisfy certain properties. In particular, if we write plimp—sooAi(m) = 
A;(m), then A(L) = I,—>>;"., Ai(m) must satisfy the conditions for stationarity 
presented in the previous section. Since den Haan and Levin’s method is based 


52 Like the HAC estimators, this technique has its origins in the literature on spectral 
density estimation where it was first proposed by Press and Tukey (1956). 
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essentially on the AR(co) representation, this property is guaranteed in that 
case.°? However, it may not hold if the AR polynomial is arbitrarily truncated 
at some finite lag as is done in Andrews and Monahan’s procedure. However, 
since the AR filter is just a device to reduce the autocorrelation, and not to 
remove it, Andrews and Monahan propose modifying the filter as follows to 
ensure it satisfies the required “stationarity” condition. 

To describe this modification, it is most convenient to set m = 1, which is 
the choice recommended by Newey and West (1994). Since there is now only 
one coefficient matrix, Âı (1), we denote this matrix by A. Notice that in this 
case, the condition for stationarity reduces to the requirement that the eigen- 
values of plimp—s.0A = A are less than one in absolute value.®* In practice, 
problems may occur if the eigenvalues of A satisfy this condition but are close to 
one. Therefore, Andrews and Monahan (1992) propose modifying A to ensure 
its eigenvalues are less than 0.97 in absolute value. Their procedure is based 
on the Singular Value Decomposition of A. This decomposition is A = BAC’ 
where A is a diagonal matrix whose elements are all non-negative.°> Andrews 
and Monahan (1992) show that the eigenvalues of A are guaranteed to satisfy 
the required constraint if all the elements of A are less than or equal to 0.97. 
If this is not the case, then Andrews and Monahan (1992) recommend the of- 
fending elements of A are replaced by 0.97 to give a new matrix A and A is 
replaced by A = BAC’. Simulation evidence in Andrews and Monahan (1992) 
and Newey and West (1994) suggests the use of prewhitening and recolouring 
improves the finite sample performance of the asymptotic confidence intervals 
n (3.27). So for completeness, we conclude this section by bringing together all 
these recommendations into a single procedure. Although, this was originally 
proposed by Newey and West (1994), we shall give it a more general name since 
it represents the synthesis of results and simulation evidence reported in all the 
papers cited above.*® 


Estimation of S when f is Stationary and Ergodic 


1. Estimate the model f= Afi-1 +e: by least squares to give A. Let A= 


BAC" be the Singular Value Decomposition of A. Define A to be the 
diagonal matrix whose (i,i) element is given by Ay = min{ Aj, 0. 97} 
and A = BAC’. Construct č; = f= Afe 


53 Of course, this statement is subject to certain regularity conditions being satisfied; see 
den Haan and Levin (1996). 

54 See Hamilton (1994) [p.259]. 

55 This decomposition can be calculated straightforwardly in most computer packages for 
matrix analysis. It is defined as follows. First, note that A'A and AA’ have exactly the 
same set of nonzero eigenvalues, which we denote by {oi i = 1,2,...r}. It is reasonable to 
assume in our context that r = q and so both A'A and A’A are oe full rank. The it? diagonal 
element of A is 5;. The matrix Ê is the (q X q) matrix whose it? column is ue eigenvector 
of AA’ associated with the 6;. The matrix C is the (q x q) matrix whose it? column is the 
the eigenvector of A’A associated with 5;. For example, see Dhrymes (1984) [p.78] or Strang 
(1988) [Appendix A] for a more detailed discussion. 

56 Also see Section 4.3. 
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2. Use an HAC estimator in conjunction with Newey and West’s method of 
bandwidth selection given above to construct the matrix 


Fal 
x = Ty + X wir(Î; + ry 
i=l 
D -1 5`>T = ol 
where T; = T ae C4€,_;- 
3. The estimator of S is 
Sep {h Ay Sa Al (3.58) 


where the subscript “SE” stands for “stationary and ergodic”. 


The choice between the covariance matrix estimators Ssv, Sy ARMA and 
Sgn depends on the model in question. Whichever estimator is appropriate, it 
can be used to calculate the approximate large sample confidence intervals for 
Oo, given in (3.27). This section concludes with an illustration of the various 
methods in the context of Hansen and Singleton’s (1982) consumption based 
asset pricing model. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

Since z; € Q, it follows from (1.22) that f(v1,00) = 2(d0a}%4 122,141 — 1) is 
martingale difference sequence. Therefore the economic model implies S can be 
consistently estimated by S'isy given in (3.40). In spite of this structure, we shall 
use this example to illustrate all the various methods discussed in this section. 
For den Haan and Levin’s (1996) method K is set equal to int{T!/3} = 7 but in 
each case the Schwarz criteria chooses k=0 and so indicates that f; is serially 
uncorrelated. In this case, Syarma equals S'sy. Three versions of Sy 4c are 
calculated: one for each kernel in Table 3.3. In each case, we fix the bandwidth 
to br = 7. Finally, three versions of sp are calculated; again one for each 
kernel. The bandwidth for each is calculated using the parameters in Table 3.4, 
and so n equals int{T?/9} = 3, int{T*/?>} = 2, int{T?/*>} = 1 respectively for 
the Bartlett, Parzen and Quadratic Spectral kernel. We arbitrarily chose to set 
h = (1,1,1,1,1)’. Clearly the width of the confidence intervals is determined 
by the standard error of the estimates, 


where Versi is the 7” main diagonal element of Vr = Mr Ôr Mp and Mr = 
[Gr (6)'WrGr(6)]-!Gr(6)'Wr. So, for brevity, only the standard errors of 4 


and ôr are reported. Table 3.5 contains these statistics for the case in which 
the model is estimated with equally weighted returns (EWR), and Table 3.6 


th 
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presents the results for the case in which the model is estimated with value 
weighted returns (VWR).°” 


Certain features stand out. First, the different choice of covariance matrix 
estimator has some impact on the calculated standard errors. In principle, all 
the versions of r are consistent if the model is correctly specified because then 
f (vt, 90) is a martingale difference sequence. Since den Haan and Levin’s (1996) 
method confirms the absence of serial correlation in f(v;, 40), these differences 
reflect inherent randomness or finite sample bias. Secondly, the estimates based 
on Wr = (T157 %2!)~! give much smaller standard errors. Finally, no 
matter what the choice of asset or weighting matrix, do is far more precisely 
estimated than Jo. 


Table 3.5 
Standard errors of the first step estimators for the 
consumption-based asset pricing model with EWR 


Wr Sr s.e.(Fr) s.e.(ôr) 
10575 SU, VARMA 6.844 1.210 x 107? 
HAC(B,7) 5.893 1.036 x 10-2 
HAC(P,7) 6.458 1.132 x 107? 
HAC(Q,7) 5.549 9.720 x 1073 
SE(B,1) 7.670 1.360 x 107? 
SE(P,4) 7.148 1.254 x 107? 
SE(Q,2.2) 7.340 1.293 x 107? 
(T5 zaz)7} SU, VARMA 2.263 4.393 x 1073 
HAC(B,7) 2.134 4.544 x 10-3 
HAC(P,7) 2.148 4.540 x 10-3 
HAC(Q,7) 2.091 4.502 x 10-3 
SE(B,0) 2.430 4.916 x 10-3 
SE(P,1) 2.420 4.894 x 1073 


SE(Q,2.49) 2.308 4.726 x 1073 


Notes: B, P, Q denote the Bartlett, Parzen, Quadratic Spectral kernel. For 
K=B,P or Q: HAC(K,7) denotes an HAC estimator kernel with K kernel 
and br = 7; SE(K,b) denotes Sgm with K kernel and estimated bandwidth b. 


57 It should be noted that evidence reported below indicates the model is misspecified for 
EW R and this renders the standard errors in (3.59) invalid. See Section 5.1.4 and Section 4.2 
respectively. 
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Table 3.6 
Standard errors of the first step estimators for the 
consumption-based asset pricing model with VWR 


Wr Sr s.e.(ĝr) s.e.(ôr) 
10° Is SU, VARMA 5.840 1.063 x 107? 
HAC(B,7) 4.559 8.447 x 1073 
HAC(P,7) 4.827 8.852 x 1073 
HAC(Q,7) 4.342 8.059 x 1073 
SE(B,1) 5.593 1.032 x 1072 
SE(P,5) 5.073 9.315 x 10-3 


SE(Q,0.78) 5.632 1.038 x 107? 
(T5 mz) SU, VARMA 11.867 _—3.761 x 107-3 


HAC(B,7) 1.699 3.523 x 10-3 
HAC(P,7) 1.722 3.548 x 1073 
HAC(Q,7) 1.674 3.489 x 1073 
SE(B,0) 1.850 3.765 x 10-3 
SE(P,4) 1.761 3.626 x 10-3 


SE(Q,1.62) 1.812 3.727 x 107 


Notes: See Table 3.5 for definitions. 


3.6 The Optimal Choice of Weighting Matrix 


In Section 3.3 it is shown that if q = p then GMM is equivalent to the Method 
of Moments estimator based on E[f (v+, 0o)] = 0 and so does not depend on the 
weighting matrix. However if q > p then no such reduction is possible and it 
is clear from Theorem 3.2 that the asymptotic variance of Ôr depends on Wr 
via W.58 This opens up the possibility that inferences may be sensitive to W. 
Just as in the linear model, it is desirable to base inference on the most precise 
estimator and so the optimal choice of W is the one which yields the minimum 
variance in a matrix sense. Once again, this choice is S~!; however this time 
we state the result more formally. Hansen (1982) proves this result but we note 
parenthetically that his argument is different from the one employed below. 


Theorem 3.4 Optimal Choice of Weighting Matrix 
If Assumptions 3.1-3.5, 3.7-3.13 hold then the minimum asymptotic variance 
of Ôr is (G>S~!Go)~' and this can be obtained by setting W = S~*. 


Note that the regularity conditions are imposed to ensure that Êr has the asymp- 
totic distribution given in Theorem 3.2. 


Proof of Theorem 3.4: 
Let 67(W) be the GMM estimator based on Assumption 3.3 with weighting ma- 
trix Wr. It can be recalled from Section 2.4 that the result is established if it 


58 If p= q then the asymptotic variance of 67 is MSM’ = (GoS~'Go)7t. 
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can be shown that V(W)—V(S~') equals a positive semi-definite matrix, where 
V(W) denotes the variance of the limiting distribution of T!/?[67(W) — 8o]. 

To begin the proof, it is useful to relate T!/2[67(W) — 8o] to T!/2[ĝr (S7!) — 
6o]. This is done quite simply by noting that 


TY? 6r(W) — bo] = TY? [6r(S~') — bo] + T? [6r(W) — Or(S™*)] (3.60) 
Now, from (3.33) it follows that 
T'/?(67(W) — 0] = —M(W)T*/297(00) + op(1) (3.61) 
where M(W) = (GogWGo)7!GoW, and so that 
TY? [ĝr(W) — 6r(S~')] = —[M(W) — M(S™)|T'? gr(G0) + op(1) (3.62) 


Therefore, if we substitute (3.61) and (3.62) into (3.60) and calculate the 
limiting variance of each side then it follows that 


V(W) = V(S")+VYy+0+C (3.63) 
where V; = limp_... Var[{M(W) — M(S~!)}T'/2.g7(09)] and 
C = lim Cov hw) — M(S71)}T!/2 97 (09), M(STHT!?gr(80)| (3.64) 
Equation (3.63) is easily rearranged to give 
V(W) - Vis) =Yy+C+C (3.65) 


Now Vj is positive semi-definite by construction, and so we focus attention on 
C. By definition, it follows that 


C = lim E[{M(W) — M(S-1)}T"? gr (60)T"? gr (0)' M(S57 


= {M(W) —M(S~})} lim E [T"gr(0o)T"?gr(60)] M(S1Y 
= M(W)SM(S™Y — M(STHSM(S7!Y = 0 (3.66) 


Equations (3.65)-(3.66) and the definition of Vı establish the desired result. 
© 


The proof is derived by showing that C = 0. It can be recognized that C 
is the asymptotic covariance between T!/?[67(S7!) — 09] and T™?[ĝr(W) — 
67(S~)]. Therefore, C = 0 implies that T!/2[67(S~1) — 8o] is asymptotically 
uncorrelated with T!/2[67(W) — 67(S~!)] for any W. 

Theorem 3.4 implies the optimal choice of Wr is oe 1 Where Îr is a consis- 
tent estimator of S. As in the linear model, the construction of this estimator 
requires at least two steps. On the first step a sub-optimal choice of Wr is used 


90 GMM Estimation 


to obtain a preliminary estimator, Êp(1). This estimator is used to obtain a 
consistent estimator of S, which is denoted Sp(1). On the second step ĝo is 
re-estimated with Wr = Sip(1)~!. The resulting estimator, 67(2), has the min- 
imum asymptotic covariance matrix given in Theorem 3.4.°9 However, this two 
step estimator is based on a version of the optimal weighting matrix constructed 
using a sub-optimal estimator of #9. This suggests there may be finite sample 
gains from using 67(2) to construct a new estimator of S, Ŝr(2) say, and then 
re-estimating 0) with Wr = S'7(2)~!. The resulting estimator, 67(3), also has 
the same asymptotic distribution as Êr(2) but it is anticipated to be more ef- 
ficient in finite samples. This potential finite sample gain in efficiency provides 
a justification for updating the estimate of S again and re-estimating 09. This 
process can be continued iteratively until the estimates converge; if this is done 
then it yields what has become known as the iterated GMM estimator. The i” 
step of such an iterative procedure is as follows. 


The i*” Step of Iterated GMM Estimation 


e Ifi =1: Estimate 09 using GMM based on the population moment con- 
dition in Assumption 3.3 with a sub-optimal weighting matrix, such as 
Wr = Iq. Denote this estimator by Êr(1). Use this estimator to construct 
a consistent estimator of S by one of the methods described in Section 
3.5.0 Denote this estimator by Ŝr(1). 


e Ifi >1: Estimate 09 using GMM based on the population moment con- 
dition in Assumption 3.3 with Wr = Sr(i—1)7! where Sp(i—1) is a 
consistent estimator of S based on Êr(i — 1), the estimator of 0o from the 
(i—1)* step. If ||êr(i)—ôêr(i—1)|| < e then the procedure has converged 
and the iterated GMM estimator is Êr = 67(i). If \|Or(4) —Or(i—1)] > €o 
andi < Imax then go to the (i +1)" step. 


Typically €o is set equal to some small positive number such as 1078. Notice 
that a ceiling of Imaz has been placed on the number of steps. This is needed 
because in practice there is no guarantee that this iterative procedure converges 
and so limiting the number of steps is a safeguard against putting the com- 
puter into an infinite loop! Regardless of whether convergence occurs before the 
chosen Imax, all {6r(i), i > 1} have the same asymptotic distribution with the 
covariance matrix given in Theorem 3.4. 

The choice of W = S7! has a second important implication for the asymp- 
totic behaviour of the estimator which is presented in the following theorem. 


Theorem 3.5 Asymptotic Independence of T!/2(47 — 69) and 
S-1/271/2 gr (Or) 
If (i) Assumptions 3.1-3.5, and 3.7-8.13 hold; (ii) W = S7}; then T'/?(6—00) 
and S~/?T'/2q7(67) are asymptotically independent. 

59This estimator is sometimes refered to as Hansen’s two step estimator because it is pro- 


posed in Hansen (1982). 
60 Also see Section 4.3. 
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Proof: 

First recall that Theorems 3.2 and 3.3 establish that both statistics converge to 
normal distributions, and so a necessary and sufficient condition for asymptotic 
independence is that these two statistics are asymptotically uncorrelated. The 
latter can be deduced from (3.26) and (3.36). Using Lemma 3.3 and putting 
W = $7", it follows from (3.26) and (3.36) that 


T? (ôr —00) = Hır + 0,(1) (3.67) 
Wet gr (Or) = Hor + 0p(1) (3.68) 
where 
Hir = —[F(00) F00] EF (80) ST? gr (80) 
Hor = [L — P(0o)]S 7T? 97° (80) 


If we let C = limr— oo Cov| Hi, r, H2,7] then it follows from Theorems 3.2 and 
3.3 that 


C = lim E|H rH, r] (3.69) 
Using (3.67) and (3.68) in (3.69), we obtain 
C = lim E|- [F(8o) F(80)] F (00) S-"?T"gr(00)T™gr(00) S"? x 
[a — P(80)]] 
= [F (0o) F (Go) F (00) 5"? {Jim Var[T?gr(80)] } S? x 
[4 — P(Oo) 


= —[F (60)! F(6)| F (00) S" S S"? I — P(6)] 


Now, by definition, we have 9 = S$1/2'91/2 and S-! = $-1/2'9-1/2 which 
together imply S~!/? = ($1/?’)-1. It therefore follows that 571/258571?” = Å. 
Using this identity C reduces to 


C = -[F(0) F (0o) F (0o) lly — P) = 0 © 


This independence property is exploited in the construction of certain test 
statistics described in Chapter 5. However, in our present context, it provides an 
interesting perspective on why this choice of W leads to an efficient estimator. 
First, notice that if we repeat the sequence of steps in the proof of Theorem 3.5 
with any other choice of W then the end result is that C # 0. Therefore, W = 
S~ is the only choice of weighting matrix for which the estimator is statistically 
independent of the part of the moment condition unused in estimation. In other 
words, by making this choice of W, we have extracted all possible information 
about the parameters contained in the sample moment. 

The estimators described in this section are often described as “the optimal 
two step GMM” or “optimal iterated GMM” estimator. It is important to re- 
alize that this optimality only refers to the choice of weighting matrix. These 
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are the most precise GMM estimators which can be constructed from the given 
population moment condition E[f(v;,09)] = 0. It does not imply that there is 
anything optimal about the population moment condition itself. The optimal 
choice of moment condition is discussed in Chapter 7. We conclude this section 
with an empirical illustration of the two-step and iterated estimator. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

Table 3.7 contains the two step and iterated GMM estimation results for both 
equally weighted returns (EWR) and value weighted returns (VWR). Since the 
economic model implies f (v+, 0o) is a martingale difference sequence, the covari- 
ance matrix is estimated by Ssu at each step. The convergence criteria for the 
GMM iterative procedure is ep = 1076. Convergence only took four iterations 
with VWR and five with EWR. After two steps the impact of the first-step 
weighting matrix is clearly diminishing. With iteration, the impact disappears 
completely. 


Table 3.7 
Two step and iterated GMM estimators for the consumption 
based asset pricing model with EWR and VWR 


EWR: 
we) (47,67) fori =1 (4r,6r) for i = 2 (Ar, 6) after iteration 


A (—3.145, 0.999) (—0.328, 0.999) (—0.343, 0.992) 
B (0.398, 0.993) (—0.317, 0.992) (—0.343, 0.992) 
VWR: 


we) (47,67) fori =1 (4r,6r) for i = 2 (47, ôr) after iteration 


A (—1.871, 0.998) (0.706, 0.994) (0.666, 0.994) 
B (0.698, 0.994) (0.666, 0.994) (0.666, 0.994) 


Notes: wi) denotes the first-step weighting matrix, A denotes wi) =10°J5 and B 
denotes wP) S(T wey zz). 


Table 3.8 reports the standard errors and 95% confidence intervals for the pa- 
rameters. A comparison with the first step standard errors in Tables 3.5 and 
3.6 indicates that iteration has increased the precision. As before, the discount 
factor is very precisely estimated, but the coefficient of relative risk aversion is 
not. In fact, the confidence intervals for yo include values which exceed one. It 
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may be recalled from Section 1.3.1 that yo < 1 was a necessary restriction for 
the representative agent to possess a concave utility function. However, this is 
not necessarily a concern since the confidence intervals are also consistent with 
the representative agent’s utility function being concave. 


Table 3.8 
Approximate standard errors and 95% confidence intervals for the 
iterated GMM Estimators in the consumption based asset pricing model 


Asset  s.e.(ĝr) c.i.(4r) s.e.(ôr) c.i.(ôr) 
EWR 2.215 ~— (—4.863, 4.000) 0.004 (0.983, 1.000) 
VWR 1.823  (—2.916,4.249) 0.004 (0.987, 1.001) 


Notes: s.e.(.) denotes the standard error calculated using (3.59) with Wr = SS Sr = Ssu 


and su is defined in (3.40). c.i.(.) denotes the 95% confidence interval calculated using 
(3.27). 


The imprecision of the estimates is a concern, however. The source of the 
problem can be traced to an interaction of the properties of the data and the 
nature of the nonlinearity in f(v;,0). The mean and standard deviations of real 
per capita consumption growth, x1 441, are 1.002 and 0.004 respectively. The 
mean and standard deviation of the asset series, 2441, are 1.008 and 0.050 
respectively for EWR and 1.006 and 0.042 for VWR. So clearly all the series 
fluctuate approximately around one; most importantly consumption growth de- 
viates very little from this value. The nonlinearity enters through the Euler 
equation residual 

u(0) = da] 44122041 — 1 (3.70) 


Now, if we replace #1441 and £244, in (3.70) by their approximate means of 
one then we have 
u(O) œ 6177* -1 


This approximation can be set to zero by putting 6 = 1 regardless of the value of 
y. Of course, the data exhibit some variation so that the approximation does not 
hold exactly. However it is close enough to give the flavour of the problem here: 
the population moment condition provides very good information about ôo but 
poor information about yo. This is an example of the case in which a parameter 
is weakly identified by the population moment condition. This situation occurs 
sufficiently frequently to have generated its own branch of GMM theory, and 
this is reviewed in Section 8.2. 

Although we return to this model to illustrate other aspects of the GMM 
framework, this nevertheless seems the most appropriate place to mention briefly 
subsequent developments in the empirical literature on this topic. Since Hansen 
and Singleton’s (1982) study there have been a number of papers which have 


94 GMM Estimation 


estimated the consumption based asset pricing model with more sophisticated 
utility functions; see Kocherlakota (1996) for a survey. However, empirical suc- 
cess has been limited. Many studies encounter the same problem as we did 
above: aggregate consumption data exhibits far less variation than asset re- 
turns and so cannot possibly explain how these assets are priced. This could 
mean the economic model is fundamentally wrong or that we have the wrong 
measure of consumption. The latter explanation has recently received some 
attention. Mankiw and Zeldes (1991) document that stocks are owned by ap- 
proximately only thirty percent of the U.S. population and therefore aggregate 
consumption is unlikely to be a good proxy for the consumption of asset holders. 
Unfortunately, aggregate data for stockholders are unavailable. Hagiwara and 
Herce (1997) circumvent this problem by using aggregate dividends to proxy 
the consumption of asset holders and find this subsitution leads to far more 
reasonable empirical results. © 


3.7 Transformations, Normalizations and the 
Continuous Updating GMM Estimator 


So far in this chapter, we have treated the data and parameter vector as given. 
However, in practice, a researcher may have to make decisions about the scale 
of the data or the parameterization of the model or whether to transform f(.) 
in some fashion. In this section, we consider the extent to which the GMM 
estimator is invariant to such decisions. It emerges that the estimator can be 
sensitive to these types of transformations, and this motivates both a variant 
of GMM known as the continuous updating estimator and also an alternative 
method for the calculation of confidence intervals. Both these extensions are 
discussed in this section. 

To begin, it is useful to distinguish five types of transformation which are 
considered below. 


e Units of measurement for vz: In some cases a researcher must decide what 
units in which to measure the data. For example, any nominal value can 
be measured in $’s, 1000$’s or 1,000,000$’s. The choice between them 
determines whether a price of one thousand dollars is recorded as 1000, 1 
or 0.001, and so determines the scale of the data. 


e Reparameterization: Suppose ĝo is globally identified and ðo = h( 0) 
where h : RP — R? is a continuous, differentiable bijective mapping. 
In this case, the population moment condition can be reparameterized as 
Elf (vt, h(q0))| = Elfy (vt, Yo)] = 0, and GMM can be used to estimate yo 
based on E[f,(v:, ¥o)] = 0 instead of 0o based on E[f (vz, 0o)] = 0. 


e Normalization of the parameter vector: In some cases, 09 may only be 
identified up to some scaling factor and so it is necessary to impose some 
normalization on ĝo, such as 09,; = 1, in order to achieve identification. 
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e Curvature altering transformations of the population moment condition:°! 


In some cases, the objective function may be ill-behaved and researchers 
have found it advantageous to scale the population moment condition by 
some function of the #9. In other words, estimation of 09 is based on 
c(90) EJF (vz, A0)] = 0. 


e Stationarity inducing transformations: In some cases, the underlying model 
may imply E[h(%,00)] = 0 in which % is a vector of nonstationary vari- 
ables. Such a specification is outside our framework because Assumption 
3.1 is violated. However, it may be possible to find a nonsingular ma- 
trix H(0,_1,99) say, such that H(t,_1, O9)h(%, 90) = f(v, 0o) where v 
is a vector of stationary random variables, and E[f(v:,90)| = 0. In this 
case, GMM estimation can be based on the population moment condition 
ELf (vs, 80)] = 0. 


Below we consider the impact of each type of transformation on the GMM 
estimator in turn. 


The GMM Estimator and the Units of Measurement for v; 

In general, the GMM estimator is not invariant to changes in the units of mea- 
surement of v,. A simple example illustrates. Let v; be a scalar random variable 
with unknown population mean 69. This definition implies that, 


Ew} — 0 = 0 (3.71) 


Since ĝo is just identified by (3.71), the GMM estimator is just the Method 
of Moments estimator which, in turn, is Êr = T~! aaa v. Now suppose v+ is 
replaced by x; = cu; in (3.71) for some non-zero, finite constant c. The resulting 
GMM estimator of ĝo is Op = T7! Sea xı. It is easily verified that Or = côr, 
and so the GMM estimator is not invariant to changes in the scale of the data. 
However, this lack of invariance is a strength rather than a weakness because 
the scaling of the data has changed the interpretation of the parameter 09. In 
one case, it is the population mean of v, and in the other, it is the population 
mean of 7; = Cvr. 

It is important to realize that the lack of invariance applies to scale changes 
in v, that is to the random variables which appear in the population moment 
condition. In some cases, v may itself be a function of a set of underlying 
variables and changes in the units of these variables may or may not have 
an impact on the scale of v. For example in Hansen and Singleton’s (1982) 
consumption based asset pricing model, v, is defined to be (Ct+1/Ct, rt+1/p+). In 
this case, since the elements of v, are ratios, changes in the units of c, or asset 
prices (with commensurate changes in the returns) have no impact on v;, and 
hence no impact on the GMM estimator. ce) 


61 This type of transformation is sometimes refered to as “normalization” of the population 
moment condition. However, we eschew this terminology to avoid confusion with the concept 
of normalization of the parameter vector. 
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The GMM Estimator and Reparameterization 

The GMM estimator is invariant to reparameterization in the sense that the two 
parameterizations yield logically consistent estimators. However, a similar result 
does not extend to the estimated asymptotic standard errors, and so inferences 
may be sensitive to the choice of parameterization. These two statements are 
now justified in turn. 

Let Q,,r(7) be the GMM minimand associated with the reparameterized 
model, that is Qy,r(y) = Qr(h(y)), and ĝr = argminQ,,r(y). Given the 
properties of h(.) stated above, it is possible to calculate Ẹr as follows. First, 
Qr(h(7)) can be minimized with respect to h(7) to yield hr, say. Then, hr = 
hr) can be solved to yield a unique value for Ẹr. It is easily recognized that 
hp = Êr and so by construction 


Ôr = h(ĝr) (3.72) 


Therefore the two estimators are logically consistent. However, the same cannot 
be said for inferences based on the estimator, as we now show. 

It can be recalled from the discussion following (3.27) that the estimated 
asymptotic standard errors of Êr are the square roots of the diagonal elements 
of the matrix, 


Vor = [Grr WrGr(ôr) Gr (6r)'WrStWrGr(6r) 
x [Gr (ôr WrGr (ôr) (3.73) 
Similar arguments imply that the corresponding matrix for Vr is given by 
Vr = [Gyr Âr) WrGy rir) Grrr) Wry rWrGy rlr) 
x [G r (fr) WrGy r (îr) (3.74) 


where Gy 7(.), and S},r are the analogs of Gr(.), Sp only defined in terms 
of f,(.) instead of f(.). Intuition suggests that these two matrices should be 
related, and they are. To see how, note that (3.72) implies f(v, Ôr) = fy (ve, 47), 
and hence that Sr = = Sy rT ~ assuming the same generic covariance matrix 
estimator is used in each case, of course. Furthermore, by the Chain rule 


Df,()/OY = {Of(.)/00"} Ah(.)/Oy/ 
and so, using (3.72), it follows that 
Gy.r(4r) = Gr(6r)H (Fr) (3.75) 


where H(.) = Oh(.)/O7’. Collecting these results together and making the 
appropriate substitutions into (3.74), it can be shown that 


Vy = [HAr] VrH Gry)? (3.76) 


To illustrate how reparameterization may affect inferences, it suffices to take 
a simple example. Suppose p = 1 and h(y) = 78. The asymptotic confidence 


interval for yo is 
Ap + zay Vo,r/T (3.77) 
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Since 6 = 7? and (3.76) holds with H(4r) = 342, it follows that (3.77) implies 
the following interval for 69 


3 3 
( {re - adit Y Vacr/TY , fan + zap BR y rT) ) (3.78) 
In contrast, the asymptotic confidence interval based upon Êr directly is 


Ôr + za/2\/ Vor/T (3.79) 


In general, there is no reason why the intervals in (3.78) and (3.79) should be 
equal. 

This sensitivity is a potential source of concern, and motivates an alternative 
method for the construction of confidence intervals that is discussed later in this 
section. However, it is worth noting one defence of the intervals described above. 
It can be argued that many economic models imply a “natural parameterization” 
and so this is the only parameterization of interest. For example, in Hansen and 
Singleton’s (1982) consumption based asset pricing model, there are two aspects 
of the agents behaviour which are crucial for the model: his/her discount factor 
and coefficient of relative risk aversion. In our presentation in Section 1.3.1, 
these two aspects of the model are captured directly by unknown parameters 
(60,70). Alternatively, the model could have been parameterized so that the 
discount factor and risk aversion are captured by hi(7) and ha(ņ2) say, for 
some prespecified functions h;(.) of unknown parameters (71,72). However, in 
this second approach the unknown parameters have no meaningful economic 
interpretation. So the first parameterization is argued to be the “natural” one 
for this model and the second, by implication, to be “unnatural”. While this 
argument may not find universal favour, it is certainly the case that published 
studies tend to employ the natural parameterization. © 


The GMM Estimator and Normalization of the Parameter Vector 
In general, the GMM estimators associated with different normalizations of the 
parameter vector do not exhibit a logical consistency in finite samples. However, 
they do exhibit a logical consistency in the limit. 

This particular issue has been the focus of some attention in the literature 
on the use of the linear quadratic model for inventory holdings, and this set- 
ting provides a convenient framework for our discussion. Several papers have 
contributed to this part of the literature but our discussion is based on Fuhrer, 
Moore, and Schuh (1995).°? The model has essentially the same structure as 
the one described in Section 1.3.4 except that now the cost functions take the 
form, 


Ca, = (001/2)Q? + (002/2)(Qt — Qi-1)? 
Ci, (90,3/2)(It — wolt-1)? 


62 The interested reader is refered to Fuhrer, Moore, and Schuh (1995) or Blinder and 
Maccini (1991) for the appropriate references. 
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With these definitions, the Euler equation becomes 


E[9o,1(Qe — oQ) + 0,2(AQe — 200AQi + LAQ) 
+ Oo 3lt + 90,45:|Q] = 0 (3.80) 


where Go and Q; denote the discount factor and information set at time t respec- 
tively (as in Section 1.3.4), A denotes the difference operator, and we have 
set 99,4 = 90,3w0. It is common in the literature on this model to fix the value 
for Bo a priori because then the Euler equation is linear in both the parameters 
and variables. We follow this practice, and so the Euler equation can be written 
more compactly as, 

Ele; (Go) |Q] = 0 (3.81) 


where 
ez(0) = Oi, Rit + 0oRo4 + O31; + O45; (3.82) 


and we have set Riz = (Qt — B0Qr41), Roe = (AQi — 260 AQt41 + BEAQi+2), 
and ĝo = (90,1; 90,2; 90,3; 90,4). Using similar argument to (1.23), it follows from 
(3.81) that 

Elzrex(80)] = 0 (3.83) 


for any zt E Qr. 

Ideally (3.83) would form the basis for GMM estimation of 69. However, 
inspection of (3.82) reveals that ĝo is not identified by this population moment 
condition: if (3.83) holds then so does E[z,e:(0)] = 0 for 0 = cp and any finite 
constant c. In other words, ĝo is only identified up to a scaling factor. In the 
absence of any additional information about the parameters from the underlying 
economic theory, it is necessary to impose some arbitrary normalization on 6 
in order to facilitate the estimation. For the purposes of exposition, we consider 
two such normalizations. First, suppose the elements of e;(00) are divided by 
60,1 to yield 


člo) = Rit + Yor Ree + Yoel: + W035 (3.84) 


where Yo, = 90,i41/00,1- Secondly, suppose the elements of e;(69) are divided 
by 00,4 to yield 


Elpo) = Qo,1Rı t + Qo,2R2 + goal: + St (3.85) 


where $0, = 90,i/90,4. Notice that both these normalizations of ĝo are logically 
consistent in the sense that given yo it is possible to solve uniquely for ¢9 and 
vice versa.** 

These normalizations lead to two different population moment conditions 
upon which estimation can be based, 


Elz é:(vo)] = 0 (3.86) 
E(z€:(¢0)] = 0 (3.87) 
63 That is AQ: = Qi — Qt-1. 


64 Specifically, the mapping between them is given by ¢1 = 1/3, ¢2 = v1/v3, 63 = W2/v3 
where it is assumed for simplicity that all coefficients are non-zero. 
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Since both population moment conditions have the linear structure considered 
in Chapter 2, we can appeal to that earlier analysis to deduce that wo is identi- 
fied by (3.86) provided rank{ B[z2, 4} = 3 where a4 = (—Rot,-Li, — St), 
and ġo is identified by (3.87) provided rank{ E [2,25 |} = 3 where %24 = 
(—Ri 4, -—Ro2,—-I)’. The form of the estimators is given by (2.8), that is 


T -1 
wr = TY au) Wr(T oe 4 ] 
7 T 
TY nus \Wr(T > aR) (3.88) 
T z —1 
ér = 74 S T2 a2) \Wr(T o>, ZL. wt ] 
j T 
TS aye) Wel S154) (3.89) 
t=1 t=1 


It is remarked above that the two normalizations of 69 are logically consistent. 
Since or 2, wo and br Z, œo, the estimators must exhibit a similar logical 
consistency in the limit. However, there is no reason for or and or to exhibit 
this property in finite samples. For example, even though the model implies 
o0,1/%0,2 = 0,2/¢0,3, the corresponding estimators in (3.88)—(3.89) do not ex- 
hibit this property, that is ora lr. Æ or2 J br.3 in general. Fuhrer, Moore, 
and Schuh (1995) provide empirical evidence that the estimators of inventory 
models can be very sensitive to the choice of normalization. Further evidence 
is provided by the simulation study reported in West and Wilcox (1994). © 


The GMM Estimator and Curvature Altering Transformations of the 
Population Moment Condition 
The GMM estimator is invariant to curvature altering transformations of the 
population moment condition if the parameter vector is just identified; however, 
if the parameter vector is overidentified then it only exhibits this property in 
the limit. 

We begin with the just identified case, that is p = q. Suppose that GMM 
estimation is to be based upon the transformed population moment condition, 


c(0o) E[S (ve, 90)] = 0 (3.90) 


where c(ĝo) is a finite non-zero scalar.® Since p = q, the GMM estimator is just 
the Method of Moments estimator 0r obtained by solving the sample analog to 


65 For simplicity, we take c(.) to be a scalar, but the same arguments go through if c(60) is 
a (p x p) nonsingular matrix. 
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(3.90), 
pa 
c(6r)T* X fwn êr) = 0 (3.91) 


t=1 
However, provided c(Or) is finite and non-zero, (3.91) implies 


T 


TX fwn ôr) = 0 (3.92) 


t=1 


and so Êr is also the Method of Moments, and hence GMM, estimator of 6o 
based on E[f (vz, 90)] = 0.°° 

However, if g > p then the above argument does not go through because the 
first order conditions do not set the sample moment to zero. Specifically, the 


GMM estimator based on (3.90) is now the solution to 
i T ig 
{| TD Seon) + dnar} WaT? $ floes ôr) =0 

(3.93) 

In general, the solution to (3.93) does not satisfy the first order conditions asso- 

ciated with GMM estimation based on the untransformed population moment 

condition given in (3.12).67 However, since (3.90) holds, the estimator is con- 

sistent for 09 and so the transformation does not affect the probability limit of 

the estimator. 

As mentioned above, this type of transformation is employed when the min- 
imand is ill-behaved making estimation difficult. Such a problem occurs in 
Eichenbaum’s (1989) inventory model described in Section 1.3.3. and we illus- 
trate this type of transformation in Section 9.3 as part of our empirical investi- 
gation of this model. > 


Stationarity Inducing Transformations 

If it is possible to find one stationarity inducing transformation of f(.) then 
there are infinitely many such transformations. In general, the GMM estimator 
is sensitive to the choice of transformation in finite samples, but is consistent 
no matter which transformation is used. 

These statements are most easily substantiated in the context of a specific 
example. To this end, we consider the consumption based asset pricing model 
described in Section 1.3.1 and, to simplify the discussion, focus on the specifi- 
cation used in our empirical implementation.®* 


66 Note that if the entire population moment condition in (3.71) is scaled by c, instead of 
just scaling vz, then the resulting GMM estimator is invariant to the choice of c. 

67 The reader should be alerted to an abuse of notation in making the comparison between 
these two equations. In Section 3.2, Êr is defined to be the solution to (3.12). In the current 
paragraph, Ôr has been used to denote the solution to (3.93). 

68 That is with only one asset with a maturity of one period, and the constant relative risk 
aversion utility function given in (1.21). 
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To begin, it is useful to revisit the derivation of the population moment 
condition in Section 1.3.1 because the steps taken involve the implicit use of a 
stationarity inducing transformation. It can be recalled from this earlier dis- 
cussion that the derivation of the population moment condition began with 
a characterization of the optimal path for consumption in (1.19). Under the 
conditions given above, this equation reduces to 


pice? = So Efra |e] (3.94) 


From this starting point, we proceeded as follows. Since pee? E Qi, both 
sides of this equation were divided by p,c}°~' to give 


E(5o(rt41/ Pt) (cep /e4)” > — 1%] = 0 (3.95) 


and then the population moment condition is deduced from (3.95) using an iter- 
ated expectations argument. However, we could have taken another approach. 
Equation (3.94) can be rewritten as 


E[ðor1 cT" — peT] = 0 (3.96) 


It is then possible to use the same iterated expectations argument to deduce 
population moment conditions based (3.96). Why use the first approach and 
not the second? The answer is simple. Population moment conditions based 
on (3.95) involve 1441 = Ct+1/C¢ and %o141 = Te41/pz, both of which are 
stationary random variables. Whereas moment conditions deduced from (3.96) 
involve functions of the nonstationary variables (cz, r+, pz). 

While the choice between (3.95) and (3.96) may be clear cut. It should be 
noted that the stationarity inducing transformation used in (3.95) is not unique. 
For example, let w; E€ Q; be a stationary random variable, then division of (3.94) 
by wipic??* yields 


E[dow;, *(re41/Pe)(Ce4i/ee) > — we |Q] = 0 (3.97) 


which can also form the basis for population moment conditions involving sta- 
tionary random variables. It follows, therefore, that there are an infinite num- 
ber of stationarity inducing transformations. The impact of the choice of w 
is most easily understood by considering the moment condition upon which 
estimation is ultimately based. It can be recalled from Section 1.3.1 that 
an iterated expectations argument is used to deduce E|uz(@0)z:] = 0 where 
ut(Oo) = 50 (2.24127 444 —1). If the same argument is used starting from (3.97) 
then the resulting moment condition is simply E[u;(00)2;] = 0 where 2; = w7" 2. 
Therefore, the components in the stationarity inducing transformation play dif- 
ferent roles: division by pc? actually induces stationarity and w, simply 
scales the instrument vector. From this perspective, it is immediately apparent 
that the resulting estimator is not invariant in finite samples to the choice of 
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stationarity inducing transformation, but is nevertheless consistent for 99 for 
any suitable choice of w;,.°° © 


It is clear that either the estimator or subsequent inferences can be sensitive 
to the types of transformation considered above. The sensitivity of the estima- 
tor is particularly unappealing in the last three cases because there is clearly an 
arbitrariness to the specific normalization or transformation chosen. It is possi- 
ble to modify the GMM minimand to produce an estimator which is invariant 
to curvature altering transformations. This version is known as the continuous 
updating GMM estimator and is considered below. The sensitivity of inferences 
to reparameterization may be viewed as a less serious problem because of the 
“natural parameterization” argument. However, the latter view is not univer- 
sally accepted and so we explore an alternative method for the construction of 
confidence intervals. We now describe both these remedies in turn. 

The continuous updating GMM estimator was introduced by Hansen, Heaton, 
and Yaron (1996). The motivation for this estimator is best understood by con- 
sidering the population analog to the GMM minimand with the optimal weight- 
ing matrix. It can be recalled from Theorem 3.4 that the optimal choice of W is 
S-t. For our purposes here, it is important to note that this choice of weighting 
matrix depends on ĝo because S = limp... Var[T!/?g7r(09)}, and to emphasize 
this dependence we now write S = S(@). Using this notation, the population 
analog to the GMM minimand is 


Qpop(9) = Elf(vs,)]'S(0)-* ELF (vt, 0) (3.98) 


Notice that both the population moment condition and weighting matrix de- 
pend on 6). However, since the consistency of the estimator depends crucially 
on E[f(vz,90)] = 0 and not on S(4)~!, we have treated the dependencies of 
f(.) and S(.) on 0 differently so far. In the iterated estimation, a preliminary 
estimator of 9 is used to construct the weighting matrix and hence to eliminate 
the argument from the weighting matrix so that the minimand takes the form 


Qiter.r (0) = gr(0)'Sr(i—1)~*gr(8) (3.99) 


While this is approach is perfectly reasonable, it is not the only one possible. 
An alternative is to acknowlege the dependence of S on @ in the minimization 
and hence define the minmand to be 


Qeont,T (8) = gT (0)' Sr (0)~'gr(0) (3.100) 
where E 
Sr(0) = Tor(0) + X wir (Tir) + Tir) | (3.101) 


69 If the Euler equation is linear in the variables then it is possible to argue that an analogous 
conditional moment restriction is satisfied by the detrended variables. See Section 9.3 for 
further discussion of this approach to inducing stationarity. 
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where Tj7(0) = TS ipi F(U 0) f (1-10). Notice that Sp(@) has the 
generic form of the HAC estimators discussed in Section 3.5.3 and so Sp(09) S S 
under appropriate conditions upon the dynamic structure of f (v+, 0o) and kernel, 
wir. The continuous updating GMM estimator is defined to be, 


bcont, T = argmingce Qeont,T (6) (3.102) 


Intuition suggests that the continuous updating estimator exhibits the same 
asymptotic properties as the two step or iterated estimator, and this is the case. 
This can be established using similar arguments to the proofs of Theorems 3.1 
and 3.2 and is left to the reader. Although the iterated and continuous updating 
estimators have the same asymptotic distributions, they are typically different 
in finite samples. The first order conditions for the iterated estimation are given 
by (3.12) with Wr = S7p(i—1)7!, those for the continuous updating estimator 
are given by,” 


A x x ð S 
2Gr (ôr) Sr(ôr) ‘gr(Or) wear 


x veclgr(6r)gr(Or)'] = 0 (3.103) 


Cr) [Srôr) @ Sr Ory} 


A comparison of the two sets of equations indicates that the first order con- 
ditions for the continuous updating estimator contain an additional term due 
to the presence of the argument in the weighting matrix. To make this sec- 
ond term explicit, it is necessary to substitute in the appropriate formula for 
8S(6)/00'. For our purposes here, it is sufficient to restrict attention to the case 
in which w; 7 = 0, that is in which the long run variance is estimated under the 
assumption that f(v:, 0o) is a serially uncorrelated process. In this case, it can 
be shown that 


ðvec|Sr(0)] £ Of (vt, 0) 


a TOD {Ha @ Fle 8)] + [F(01 8) @ Lal} ag — (3.104) 


In general, there is no reason why the solutions to (3.12) and (3.103) should 
coincide for finite T. However, it can be verified that both sets of equations are 
satisfied by 0o in the limit. 

The chief advantage of the continuous updating estimator is that it is in- 
variant to curvature altering transformations of f(v;,0). To illustrate, consider 
again the situation described above in which the population moment condition 
is multiplied by c(A9), and so estimation is based on (3.90). The key difference 
now is that Qcont,r(@) depends on 0 via both the sample moment and the in- 
verse of the covariance matrix. After the transformation the sample moment is 
c(9)gr(@) and the inverse of the covariance matrix is c(@)~?S7(0)~!. Once these 
terms are substituted into the minimand in (3.100), it is easily verified that the 


70 These equations can be derived using Dhrymes (1984) [Proposition 99, p.115; Proposition 
106, p.124]. 
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factors involving c(@) cancel out, and so the estimator is unaffected by this type 
of transformation. In some cases, different elements of the population moment 
may be transformed by different functions of 0, and the previous argument is 
easily extended to cover this case and also the more general scenario in which 
f (vt, 9) is premultiplied by any nonsingular matrix C(). 


It is important to realize that the invariance of the continuous updating esti- 
mator is only with respect to curvature altering transformations of the popula- 
tion moment condition. However, there are cases in which the net effect of one 
of the other types of transformation is to premultiply the population moment 
by some nonsingular matrix C(@), and so the continuous updating estimator is 
invariant to these types of transformation in such cases as well. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

Table 3.9 contains the continuous updating estimates, their standard errors and 
95% confidence intervals for the parameters for both choices of assets. The 
starting values for the estimations are the iterated estimates reported in Table 
3.7. This choice is made for two reasons. First, if the model is correctly spec- 
ified, then the iterated estimator is consistent for 69 which is a solution to the 
first order conditions in (3.103) in the limit. Secondly, in this example, this 
choice of starting value initiates the minimization in an area within which the 
minimand is reasonably well behaved. In contrast to our experience with the 
iterated estimator, the results are very sensitive to the choice of starting value. 
In particular, for certain starting values, the numerical optimization routine di- 
verges into parts of the parameter space clearly not in the neighbourhood of the 
global minimum of TQcont,r(9@). A similar experience is reported by Hansen, 
Heaton, and Yaron (1996) in the context of slightly more sophisticated versions 
of the consumption based asset pricing model. These differing experiences can 
be explained by considering the surface of the minimands with the VWR data. 
Figure 3.3 plots the second step minimand based on the first step estimates 
calculated using Wr = 10°J; with the VWR data. A comparison with Figure 
3.2 indicates the minimand has the same valley like shape on both first and 
second steps. In contrast, the surface of the continuous updating minimand has 
a ravine in which the minimum is located as shown in Figure 3.4. The minimum 
is far harder to locate in the latter case particularly as the surface is relatively 
flat around the ravine. 


With VWR, the results are very similar, although not identical, to those 
reported for the iterated estimator. With EWR, the only noticable difference 
between the estimation results is in the estimate of yo: continuous updating 
GMM yields 0.515 for this parameter, as opposed to —0.343 with iterated GMM. 
Notwithstanding this difference, the results are qualitatively the same from 
the iterated and continuous GMM estimations. In both cases, 69 is precisely 
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Figure 3.3: Second-step GMM minimand for the consumption based asset 
pricing model with value weighted returns 
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Figure 3.4: Continuous Updating GMM minimand for the consumption based 
asset pricing model with value weighted returns 
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estimated but yo is not. Furthermore, the source of this imprecision is the same 
here as it is in the iterated estimation: yo is weakly identified by the population 
moment condition associated with continuous updating GMM. © 


Table 3.9 
Continuous Updating GMM estimation results for the 
consumption based asset pricing model 


EWR: 

(Fr, or) s.e.(ĝr) c.i.(ĝr) s.e.(dr) c.i.(or) 
(0.515,0.990) 2.229 (3.853, 4.884) 0.004 (0.981, 0.998) 

VWR: 

(4r, ôr) s.e.(Fr) c.i.(4r) s.e.(dr) c.i.(dp) 
(0.785, 0.993) 1.829 (—2.801, 4.370) 0.004 (0.986, 1.000) 


Notes: s.e.(.) denotes the standard error calculated using (3.59) with Wr = Ban Sr = Ban 


and Sgzy is defined in (3.40). c.i.(.) denotes the 95% confidence interval calculated using 
(3.27). 


It is the form of the minimand that gives the continuous updating estima- 
tor its invariance to normalization of the population moment condition. It is 
this minimand which also provides the key to the construction of asymptotic 
confidence sets which are invariant to reparameterization. We use the term 
“confidence set” because the approach described below is based on a proba- 
bility statement involving ĝo rather than statements involving its individual 
elements. This approach was first introduced into the GMM literature by Stock 
and Wright (1995, 2000) although in the context of a different problem. Stock 
and Wright are concerned with the problem of inference in the presence of 
weakly identified parameters, and we discuss this approach to inference in that 
context in Section 8.2. For the present, we focus purely on the construction of 
confidence sets which are invariant to reparameterization.’! 

To derive these confidence sets, it is necessary to consider the limiting dis- 
tribution of TQcont,r (90). This distribution follows straightforwardly from the 
limiting behaviour of its components, T™/?gr(0o) and Sp(09)~!. Under the 


conditions of Lemma 3.2, it follows that T!/2g7(0)) 4 N(0,S). If it is also 


71 In the weak identification literature, these confidence sets are sometimes refered as S-sets, 
a terminology inpsired by the notation used by Stock and Wright (2000). 
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assumed that S7(0) > S, then S7(@))~! 2 $1.72 Combining these two 
results, it follows that 


TQeont,T (00) 4 X% (3.105) 


An asymptotically valid 100(1 — a)% confidence set for ĝo is then given by 
{0 : TQeont,r (9) < cala) } (3.106) 


where cq(a) is the 100(1—a)% percentile of the x2 distribution. In other words, 
the confidence sets in (3.106) consist of all values of 0 for which the minimand 
of the continuous updating GMM estimator does not exceed the appropriate 
percentile of the limiting distribution of TQcont,r(00). It is easily recognized 
that our earlier arguments about the invariance of the estimator to reparame- 
terization can be applied here to show that the confidence sets in (3.106) ex- 
hibit the same invariance property. This confidence set is illustrated below for 
our running example. In that particular case, the calculations are relatively 
straightforward because ĝo is only a (2 x 1) vector. However, the computational 
burden increases rapidly with p and quickly becomes prohibitive. Therefore, this 
method of calculating confidence sets can be infeasible in many cases of interest. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

It can be recalled that the model has been estimated for two types of asset, 
VWR and EWR. As it turns out, these two cases provide a good illustration 
of a fundamental difference between the confidence sets and marginal intervals 
reported earlier. By construction the marginal intervals in (3.27) are non-empty. 
However, it is entirely possible for the confidence set in (3.106) to contain no 
elements, and this is exactly what happens when the model is estimated with 
EWR. Such a phenomenon provides evidence that the model is misspecified. 
We come to the same conclusion using model specification tests in Section 5.1, 
and delay further discussion of this outcome until then. Instead we focus here 
on the case in which the model is estimated with VWR. For this case, the 
95% confidence set for (ôo, yo) consists of all points within the ellipse plotted in 
Figure 3.5. This confidence set is clearly more informative than the marginal 
intervals reported in Tables 3.8 and 3.9 because it reveals a connection between 
the plausible values for yo and 69. In general terms, higher values of yo in 
the set are associated with smaller values of 49 and vice versa. However, in one 
sense the confidence set and marginal confidence intervals are similar: they both 
imply ôo is estimated very precisely but Jọ is not. © 


72 The reader is refered to the references given in Section 3.5.3 for appropriate regularity 
conditions for this result to hold. 
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Figure 3.5: 95% confidence set for ĝo in the consumption based asset 
pricing model with value weighted returns 


3.8 GMM as a Unifying Principle of 
Estimation 


It is stated in Chapter 1 that GMM provided a unifying framework for the 
analysis of many econometric estimators. At that point it was only possible 
to provide a few illustrations of this thesis but we are now in a position to 
elaborate further. So we conclude this chapter by describing how the GMM 
framework encompasses many other estimators derived using a seemingly dif- 
ferent approach. This section covers material which is irrelevant to many of the 
applications listed in Table 1.1, and so some readers may wish to proceed to 
Chapter 4. 


It is convenient to divide the discussion into two parts. First, we consider the 
case in which all the elements of 69 are estimated simultaneously. For reasons 
that will become apparent, we refer to such estimators as single step. This is 
the case upon which we have focused in the book so far. Then we consider the 
case of sequential estimators in which the elements of the parameter vector are 
estimated in stages. 
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3.8.1 Single Step Estimators 


Many econometric estimators are obtained by optimizing a scalar of the form 


T 
EO) (3.107) 


Two leading examples are least squares and maximum likelihood, both of which 
we discuss in more detail below. If N;(@) is differentiable then the estimator, 0, 
is the value which solves the associated first order conditions 


T 
XC aN, (8/30 = 0 (3.108) 


Equation (3.108) implies that ĝ is equivalent to the Method of Moments esti- 
mator based on the population moment condition 


E|ON;(90)/00| = 0 (3.109) 


Since ON;(0)/00 is a (p x 1) vector it can be recalled from Section 3.3 that ĝ is 
also the GMM estimator based on (3.109). 

As illustrations, we now derive the population moment condition implicit in 
the GMM interpretation of least squares and maximum likelihood estimation. 
A further example can be found in the next sub-section. 


Example: Ordinary Least Squares Estimation in the Linear Regres- 
sion Model 

Suppose the static, linear regression model from Chapter 2 is estimated by or- 
dinary least squares. Typically, this estimator is derived as the value of 6 which 
minimizes the residual sum of squares. Within the terms of our discussion here, 
this involves 

Ni (9) = (ye — 248)? 

Therefore, the OLS estimator can be interpreted as a GMM estimator based on 
the population moment condition 


Elzu — 248)] = 0 (3.110) 


This condition states that the regressors and error are uncorrelated and is, of 
course, one of the assumptions of the “Classical regression model”. © 


Example: Maximum Likelihood Estimation 

Suppose the conditional probability density function of the continuous station- 
ary random vector v, given {v;_1, Uz—-2,---} is p(vz;0,Vi-1) where V,- = 
(U;_1,Uj_9,---U,;_,,). The maximum likelihood estimator (MLE) of 69 based 
on the conditional log likelihood function is the value of 0 which maximizes, 


T 
Lr(0) = X In{p(vr; 0, Vi-1)} (3.111) 
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This fits within our framework with N,(@) = In{p(v:; 0, V;—1)} and so the MLE 
can be interpreted as a GMM estimator based on the population moment con- 
dition 

E [ðln{ plus; 0, V;—1)}/30] = 0 (3.112) 


Since both OLS and MLE are derived from perfectly valid estimation prin- 
ciples in their own right, it is reasonable to question whether there is any value 
to this GMM interpretation. In fact there are two main advantages. First, the 
GMM interpretation focuses attention specifically on the information used in 
estimation; whereas this is often not apparent from the original derivation of 
the estimators. For example, the importance of (3.110) for OLS estimation only 
emerges in proofs of unbiasedness or consistency of the estimator. Secondly, 
this interpretation allows the asymptotic properties of a variety of seemingly 
different estimators to be deduced using the framework discussed in the previ- 
ous sections. It is in this sense that we refer to GMM as a unifying principle 
of estimation. To illustrate both these advantages, we return to the case of 
Maximum Likelihood estimation. 


Example: Maximum Likelihood Estimation (Continued) 

It is argued in Chapter 1 that the dependence of MLE on the probability distri- 
bution was a major weakness in the types of nonlinear dynamic models in Table 
1.1. This problem is more readily appreciated using the GMM interpretation 
of MLE. The above analysis indicates that the MLE is consistent if (3.112) is 
satisfied. In fact, this population moment condition is automatically satisfied 
if the distribution is correctly specified. It is useful to prove this result here 
because it provides a natural starting point for considering the consequences of 
misspecification. 

By definition, a probability density function satisfies 


f Poito Vedd =l (3.113) 
v 


where ,,(.)dv; denotes integration with respect to v; over the sample space V. 
Differentiation of (3.113) yields 


o 


T | [pesto Vid = 0 (3.114) 


If p(.) satisfies the relatively mild conditions for the reversal of the orders of 
differentiation and integration then (3.114) implies 


| {2rlo1s 60, Vi-n)/2}de =p 
vV 
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This equation can be rewritten as 


1 
f {9 ov; 90, Vi-1) /90} plor; 00,Vi-1) pdu. = O (3.115) 
Vv plv; 0o, Vi-1) 


If the probability density function is correctly specified then (3.115) is identical 
to (3.112) because Oln{p(0)}/00 = {1/p(0)}Op(6)/00, for any scalar function 
p(.). However, notice that if p(.) is no longer the true probability density func- 
tion of v; then (3.115) cannot be interpreted as an expectation and so does not 
imply (3.112). 

Does this mean that (3.112) never holds if the distribution is misspecified? 
The answer is no, but once the possibility of misspecification is admitted then 
its theoretical justification disappears. This issue is best understood using an 
example. Consider again the consumption based asset pricing model we have 
used throughout this chapter. As mentioned in Section 1.3.1 the conditional 
distribution of 41 = (@1,241, 2241)’ = (In(1,241), In(v2,241))’ is unknown 
but let us suppose it is assumed to be normal. To be consistent with the 
economic model, the likelihood must be maximized subject to the restriction 
Elba} 4 122,041 |Q] = 1. Hansen and Singleton (1982) show that for this model 
one element of (3.112) is equivalent to the population moment condition 


E | {In(ôo) + (Yo = 1)%1 4 + L244 + 0.5{(Y0 a 1)?ou1 + 022 + 2(70 = L)oi2}}] = 0 

(3.116) 
where oj; is the i — jt” element the conditional variance of #41. Equation 
(3.116) holds if the conditional distribution of #41 is normal. However, if the 
distribution has been misspecified then this condition can no longer be justified 
by the line of argument in (3.113)—(3.115). Furthermore, a comparison with 
(1.22) indicates that (3.116) is not implied by the Euler equation of the economic 
model. Therefore, if the distribution has been incorrectly specified then there 
is neither a statistical nor an economic justification for the moment condition 
upon which this Maximum Likelihood estimation is based. This motivates the 
use of GMM estimation based on population moment conditions implied by the 
economic model. 

The problems here stem from the presence of nonlinear functions of endoge- 
nous variables in the population moment condition.” If this feature is not 
present, then (3.112) may hold for a wide class of plausible true probability 
distributions. So there are circumstances in which Maximum Likelihood is un- 
dertaken even though the distribution is unknown. In this case, it is refered to 
as Quasi Mazimum Likelihood estimation (White, 1982) or Pseudo Maximum 
Likelihood estimation (Gourieroux, Monfort, and Trognon, 1984). Both these 
sets of authors derive the asymptotic distribution of the estimator. However, 
it can also be derived directly using the GMM framework. If it is assumed 
that (3.112) holds then Theorem 3.2 implies the suitably normalized Quasi- 
MLE converges to a normal distribution with mean zero and covariance matrix 


73 See Amemiya (1977) and Phillips (1982) for further discussion of this issue. 
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(GiS-1Go)—1 where” 


Go = E[d? Inf plv; 00, Vi-1) }/0006"] 
S = Bl{An[p(vz; 00, Vi-1)]/O0}{Oln|p(ve; 80, Vi-1)]/00F 


If the distribution of v, is misspecified then no further reduction of the asymp- 
totic covariance matrix is possible. However, if the distribution is correctly 
specified then the information matrix identity’ implies (G)S~1Go)~! equals 
ST} and so the GMM framework yields the familiar result from Maximum Like- 
lihood theory. © 


3.8.2 Sequential Estimators 


So far we have concentrated on the case in which all elements of 09 are esti- 
mated simultaneously. However, in some cases it is convenient to estimate 6 
sequentially. In this section, it is shown that a class of sequential estimators are 
also special cases of GMM. We start with the general case and then illustrate 
the ideas using a model with generated regressors. 

To introduce the basic idea, it is sufficient to focus on two step sequential 
estimation procedures. Accordingly, we partition the parameter vector into 
9 = (00,1, 96,2) where ĝo, is (pi x 1) vector. Suppose that in the first step, 90,1 is 
estimated by GMM based on the population moment condition Ef, (vz, 00,1)] = 
0 with weighting matrix Wı,r. Let this estimator be Îi r. Now suppose that 
in the second step, 60,2 is estimated by GMM based on the (p2 x 1) population 
moment condition E[f2(v;,49)] = 0 with ĝi r substituted for 09. Notice that 
60,2 is just identified by E[f2(vz, 00)| = 0 conditional on 69,1 and so the weighting 
matrix plays no role in this estimation. Newey and McFadden (1994) show that 
this sequential estimation procedure is identical to the single step estimation of 
Oo via GMM based on E|[f (vz, 80)| = 0 where 


f(v:,00) = | n (3.117) 


and the weighting matrix 


Mar A (3.118) 


Mps | 0 War 


for any positive definite matrix W2,r. At first glance this may seem surprising 
but there is a simple intuition behind the result. From (3.117) and (3.118) it 
follows that the minimand for the simultaneous estimation can be written as 


Qr(@) = Qır) + Qe,7r(41, 92) 
gir (01)'Wi.rgi,7 (91) + 92,7 (01, 02) W2 ,r92,r (01, 02) 


II 


74 Notice that (3.115) implies Oln[p(v1; 0o, Vi-1)]/00 is a martingale difference sequence 
with respect to Vi_-1. 
75 For example, see White (1982). 
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where gi7(01) = TY five, 91) and go,r(O1,02) = TIE falve 0). 
Since fo(.) is (p2 x 1), there always exists a value of 6) which sets Q2,r (01, 02) 
to zero regardless of the value of 01. So the minimization of Q7r(@) can be 
performed by first finding the value 6, which minimizes Qı,r(01) and then find- 
ing the value of 02 which sets Qo,r(61.7, 92) to zero. Clearly, this is just the 
sequential procedure described above. 

It is important to notice that this argument only works for situations in which 
f2(-) is the same dimension as 60,2. To illustrate what happens if this is violated, 
it is useful to denote the dimension of f2(.) by q2. First consider the case where 
q2 < p2. This implies 69 2 is unidentified by E[f2(vz, A0)| = 0 conditional on 691 
and so ĝo is unidentified. Now consider the case where q2 > p2. This time 60,2 
is over-identified by E[f2(vz,00)] = 0 conditional on 69,1. Consequently, there 
is not generally a value of 02 which sets Q2,r (01,02) to zero for any value of 01. 
This means the value of 6, which minimizes Q7(0) is no longer the same as the 
value which minimizes Q, 7(61) alone. 

In spite of this limitation, many sequential estimators are covered by these 
conditions. The main advantage of this GMM interpretation comes in the cal- 
culation of the correct asymptotic variance for Oo. T- Since 65 T is calculated 
conditional on OT; its asymptotic distribution must take account of the uncer- 
tainty inherent in the estimation of #9. The correct distribution is typically 
not obvious when the estimator is viewed in its original sequential form. How- 
ever, the GMM perspective allows the correct form of the distribution to be 
deduced immediately from Theorem 3.2. As an illustration, we consider a more 
general version of the partial adjustment model discussed in Section 3.1. Other 
examples can be found in Newey (1984) and Newey and McFadden (1994). 


Example: A Partial Adjustment Model for Inventory Holdings 
Hall and Rossana (1991) consider the following model for inventories 


Ay = 0,0 + Y1,0¥t-1 + Y2,0Tt-1 + 3,0W1¢ + 4,0W2,4 + Ut 
Ut = pour—1 + Ct 


where y+ are inventory holdings in period t, Ay: = Yt — Yt—1, Tt-1 is a vector 
containing the number of workers, the hours per production worker, materials, 
work in progress and unfilled orders in period t — 1, wf is the expected new 
orders, and w5, is expected material prices. All variables are in logs. The error 
term ez is assumed to be independently and identically distributed with mean 
zero. If all the regressors are observed then the parameters can be estimated by 
nonlinear least squares. These estimators are defined to be 


T 
(Vp, ôr) = argminyy perxr T X {ely P) (3.119) 


t=1 


where e¢(7, p) = We() — pur-1(7), Mey) = Ave = (1, Yt—1, T41 WÍ t 9,4)" and 
y is the (9 x 1) vector of regression parameters. Unfortunately, neither of the 
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expected values are known at time t. To circumvent this problem, Hall and 
Rossana (1991) estimate these variables by their least squares predictions from 
the AR(12) models 

Wit = Wi Gio + eit 


where Wi, = (1, wit-1, Wit—2,--- Wit—-12) and 6i, o are the vector of regression 
parameters. These predictions are known as “generated regressors” because 
they are generated from a separate model. The need to predict wf, creates a 
sequential estimation.” In the first step 001 = (Bio; 62,0) are estimated. In 


the second step, 60,2 = (Yo. po) are estimated conditional on ĝi r. However, it 
is not obvious exactly how this structure would affect inference about the pa- 
rameters of the inventory equation. As suggested above, the answer is found by 
interpreting the estimation from a GMM perspective. To achieve this, we must 
derive the population moment conditions which are being implicitly exploited 
in each step of the estimation. Since the univariate AR(12) models are linear 
regression models, it follows from the previous subsection” that Bir are the 
GMM estimators based on 


W14(W1,4 — Ùi 4F1,0) 


a = 0 
W2,t(wa,t — wÙ 132,0) 


Elfi (ve, 91,0)] = E 
The minimand for nonlinear least squares estimation also fits within the frame- 
work discussed in the previous sub-section. The minimand in (3.119) can be 
obtained from (3.107) by putting N;(@) = e:(7,)?. Therefore it follows from 
(3.109) that the GMM interpretation of Hall and Rossana’s (1991) estimator is 
completed by 


0€;(A0) _ 
Elfler 4o)] = ELM a (69)] = 0 
where 
(0) = ly, 02) — pue-1(7, 1) 


II 


uly, 01) Ay: a (1, Ye—1, £41, Wy tbi, D2 b2) 


The correct form of the asymptotic distribution of the inventory equations can 
be deduced from Theorem 3.2. © 


3.9 Summary 


This chapter provides a comprehensive treatment of GMM estimation in cor- 
rectly specified models. Building from the discussion in the previous chapter, 
it is shown that the basic approach to estimation employed in the linear static 


76 Pagan (1984) presents an in depth analysis of the problems caused by generated regres- 
sors. However, he does not exploit the GMM perspective described here. This approach was 
taken first by Newey (1984). 

TT Notice that the static nature of the variables in our earlier example played no role in the 
discussion. 
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model translates readily to nonlinear, dynamic models. The basic statistical 
framework also translates; although, inevitably, the presence of nonlinearity and 
dynamics complicates the analysis at various points. Seven key features emerge. 


e Identification: For the estimation to be successful, the population moment 
condition must not only be valid but also provide sufficient information 
to identify the parameter vector. The intuition behind parameter iden- 
tification is identical to the linear model, but nonlinearity considerably 
complicates its verification within a particular model. As a result, it is 
necessary to introduce the concepts of local and global identification. 


e Calculation of the estimator: The presence of nonlinearity and, to a lesser 
extent, the dynamics means that the first order conditions do not yield 
a closed form solution for the estimator in general. Instead, the solution 
must be found via numerical optimization techniques. 


e Identifying and overidentifying restrictions: GMM estimation in overi- 
dentified models involves a fundamental decomposition of the population 
moment condition into identifying and overidentifying restrictions. The 
identifiying restrictions contain the information that goes into the estima- 
tion, and the overidentifying restrictions are a remainder that manifests 
itself in the estimated sample moment. 


e Asymptotic properties: The GMM estimator is consistent and, when ap- 
propriately scaled, has a limiting normal distribution. Here too, the ab- 
sence of a closed form solution for the estimator, necessitates a different 
approach. This difference is most marked in the proof of consistency. 
However, once consistency is established, the Mean Value Theorem can 
be used to linearize the sample moment, and the proof of asymptotic nor- 
mality can be viewed as a direct generalization of the arguments used in 
the linear model. 


e Estimated sample moment: The estimated sample moment is shown to 
have a limiting normal distribution whose attributes depend directly on 
the function of the data in the overidentifying restrictions. 


e Long run covariance matrix estimation: To translate the asymptotic nor- 
mality into practical inference procedures, it is necessary to estimate the 
long run variance of the sample moment consistently. To construct a suit- 
able estimator, it is necessary to make certain assumptions about the de- 
pendence structure of f(v, 0o), the function of the data which appears in 
the population moment condition. Three cases are considered: f (vz, 0o) is 
a serially uncorrelated process; f(v+, 0o) is generated by a vector autore- 
gressive moving average process; the class of heteroscedasticity and au- 
tocorrelation covariance (HAC) matrix estimators whose properties only 
require the dependence structure to satisfy very mild restrictions. 


e Optimal choice of weighting matrix: The optimal choice of weighting ma- 
trix converges to the inverse of the long run covariance matrix of the 
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sample moment. Therefore, in general, its use necessitates a two step or 
iterated estimation. 


In addition to the standard GMM estimation framework, this chapter also 
discusses certain important extensions. It is shown that both the estimator 
and/or subsequent inferences are sensitive to certain transformations of either 
the data, parameter vector or moment condition. These sensitivities motivate 
the discussion of the continuous updating GMM estimator and also an alter- 
native method for the construction of confidence sets based on inverting the 
minimand. 

It is also shown that GMM can be viewed as a unifying principle of estimation 
because it encompasses other methods such as Maximum Likelihood, Ordinary 
Least Sqares and certain Sequential estimation techniques. 

A key assumption throughout is that the model is correctly specified. In the 
next chapter, we consider the consequences of misspecification for the asymp- 
totic properties of the estimator and estimated sample moment. As would be 
anticipated, these consequences are not good and this motivates the use of spec- 
ification tests, such as the overidentifying restrictions test. Such diagnostic tests 
are examined in Chapter 5 as part of a more general review of hypothesis testing 
within the GMM framework. 


A 


GMM Estimation in 
Misspecified Models 


The previous chapter establishes the large sample properties of the estimator 
and its various associated statistics in correctly specified models. In practice, 
a researcher never knows whether his/her assumptions correspond to the real 
world, and so it is important to consider the impact of misspecification on the 
statistical properties derived in Chapter 3. Intuition suggests misspecification 
has a detrimental effect, and this is borne out by the analysis presented in 
this chapter.! In particular, it is shown that misspecification contaminates 
inferences about the parameter vector, and this pessimistic conclusion motivates 
the model specification tests presented in the next chapter. However, there 
is a secondary purpose to the presentation of a formal analysis of the GMM 
estimator under misspecification. Inspection of the empirical literature reveals 
that it is not uncommon to find cases in which the sample evidence suggests that 
the model is misspecified but inference about the parameters is still performed 
— either implicitly or explicitly — using the asymptotic theory appropriate for 
correctly specified models. The results presented here provide guidance on the 
interpretation of such inferences, and suggest that this approach to inference in 
misspecified models is invalid in general. 

Before we proceed further, it is useful to consider exactly what is meant 
by the term “misspecification” in our context. As seen in Chapter 1, an eco- 
nomic/statistical model consists of a set of assumptions about the data gener- 
ation process for v+. For expositional convenience, we now denote this model 
by M. This model implies a set of population moment conditions which can 
be used as a basis for GMM estimation of 69. This logical sequence can be 
represented by 


M => Elf (vr,40)] =0, Vt, for some unique 09 € © (4.1) 


1 Also see Section 2.5 for a heuristic discussion of the consequences of misspecification in 
the static linear model. 
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If M is no longer considered to be the truth, then there are two natural, alter- 
native scenarios. First, the true model, M 4, although different from M, shares 
the property in (4.1); that is 


Ma = > Elf (vr,6+)] =0, Vt, for some unique 6, € © (4.2) 


Secondly, the true model, M g, implies the property in (4.1) does not hold; that 
is 


Mp = A@€O such that E[f(v;,0)] = 0, Ve. (4.3) 


Clearly, M and M 4 are observationally equivalent on the basis of E[f(v:, 4)] 
alone.? Therefore, the estimator and the estimated sample moment have essen- 
tially the same large sample properties under M and M 4 — the only difference 
is in the use of 6) or 6, to denote the value at which the population moment 
condition and other regularity conditions are satisfied. In contrast, M and Mg 
have different implications for E[f(vz,0)], and these manifest themselves in the 
behaviour of the estimator and the estimated sample moment. It is convenient 
to reserve the term “misspecification” to denote only this second situation. As 
it stands, (4.3) states only that E[f(v,,@)] is non-zero. For the analysis in this 
chapter, it is most convenient to retain the assumption that v; is a stationary 
process, and so E|f(v;,6)] is independent of t. Therefore, we restrict attention 
to the following class of misspecified models. 


Assumption 4.1 The Nature of the Misspecification 
E| f(v, 9)] = (0) for all t and |\u(0)|| > 0 for all 0 € 0.3 


One immediate consequence of this assumption is that it excludes misspecifica- 
tion characterized by structural instability — that is, cases in which E[f (vz, 6)] = 
lt. While this obviously limits the generality of the analysis, the price is worth 
paying because Assumption 4.1 smooths the passage from correctly to incor- 
rectly specified models, and so enables us to highlight more simply the main 
differences between the two scenarios. However, in Section 5.4, we do return to 
the topic of structural instability in the context of hypothesis testing. There 
is one further consequence of Assumption 4.1 which should be noted. Taken 
together, Assumptions 4.1 and 3.1 (the stationarity of v;) imply that q > p. 
This follows because if p = q then the value which satisfies the identifying re- 
strictions in (3.19), 0, say, must also satisfy the population moment condition.4 
In other words, if the parameter vector is just-identified then the true model 
must exhibit the properties of M4 above.” 


2 Notice this definition of observational equivalence depends crucially on f(.). Since M 
and M 4 are different models they will have different implications for other aspects of the 
distribution of vz. 

3 For any vector a, ||a|| = (a'a) 

4 See Hall and Inoue (2003). 

5 This does not hold if v+ is non-stationary. 
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In practice, inference is typically based on the two step or iterated esti- 
mator. The key feature of such estimators is that the i” step estimation 
employs a weighting matrix equal to the inverse of a covariance matrix esti- 
mator calculated using @7(i — 1). This structure means that the population 
analog to the minimand on the it” step depends on the probability limit of 
Êr(i — 1) via the weighting matrix. This construction provides a mechanism 
through which the consequences of misspecification are transmitted from one 
step to the next. This means that to deduce the impact of this misspecifica- 
tion on the iterated estimator, it is necessary to consider each step sequentially. 
Therefore, we begin our discussion with the first step estimator: Section 4.1 
derives its probability limit, and Section 4.2 derives its limiting distribution. 
It emerges that misspecification considerably complicates the analysis of the 
limiting distribution. Specifically, the rate of convergence of 67(1) to 0,(1) de- 
pends on the rate of convergence of Wr to W. This means that in some cases 
T™?[Êr(1) — 6.(1)] does not converge in distribution. However, it is shown 
that this statistic has a limiting normal distribution under certain conditions 
which plausibly cover the most common choices of weighting matrix in practice. 
Section 4.3 considers the impact of misspecification on the long run covariance 
matrix estimators presented in Section 3.5. It is shown that none of these es- 
timators are consistent if the model is misspecified. However, it is also shown 
that there is a simple way to modify all the estimators to ensure consistency 
regardless of whether or not the model is correctly specified. There are cer- 
tain advantages to using one of these modified estimators in the construction 
of moment selection procedures. A formal justification of this statement is left 
until Chapter 7. Section 4.4 examines the limiting behaviour of the second- 
step estimator. Here too, the method of covariance matrix estimation is im- 
portant because it determines the rate of convergence of the weighting matrix 
to its limit and hence the rate of convergence of the estimator. We concen- 
trate on two cases. Section 4.4.1 presents the analysis when the covariance 
matrix estimator is constructed under the assumption that f(v:, 0.) is a serially 
uncorrelated process. Section 4.4.2 presents the same analysis when an HAC 
estimator is used. Section 4.5 considers the limiting behaviour of the estimated 
sample moment. Unlike the estimator, T!/2g7(6r(i)) diverges at rate T!/? re- 
gardless of the rate of convergence of the weighting matrix. Finally, Section 
4.6 provides a summary of the consequences of misspecification on the GMM 
estimator. 

Before we begin the analysis, it is necessary to address an item of notation. In 
the course of our discussion, it emerges that the p limrp_,., 6r(i) may be different 
for each i, and consequently, we use 6,,(i) to denote this limit. However, to avoid 
excessive repetition, we express assumptions in terms of @,, and then define 6,. 
in the appropriate theorem. In spite of the aforementioned dependence on i, 
there are times in which the analysis is generic to all steps and so we adopt 
the more economical notation of Ôr for the estimator and 0, for its probability 
limit. 


120 GMM Estimation in Misspecified Models 


4.1 Probability Limit of the First Step 
Estimator 


By definition, the first step GMM estimator can be constructed with any weight- 
ing matrix which satisfies Assumption 3.7. In Section 3.4.1, it is shown that 
such an estimator converges in probability to 0o in correctly specified models 
provided certain regularity conditions are satisfied. So this earlier analysis pro- 
vides the natural place to start our search for conditions under which the first 
step estimator converges in misspecified models. It can be recalled that the 
proof of Theorem 3.1 is broken down into two parts. Part (i) uses the uniform 
convergence property in Lemma 3.1 to establish that Êr minimizes Qo(@) with 
probability one as T — oo. Then part (ii) uses the population moment and 
identification conditions in Assumptions 3.3-3.4 to show that part (i) implies 
consistency. This overview suggests similar arguments can be used to establish 
the convergence of Êr in misspecified models provided a suitable replacement is 
found for Assumptions 3.3 and 3.4 in part (ii). To this end, we now introduce 
the following assumption. 


Assumption 4.2 Identification Condition 
There exists 0, € O such that Qo(Ox) < Qo(0) for all 0 € © \ {04}. 


Assumption 4.2 states that the population analog to the first step GMM 
minimand has a unique minimum at 6,. This property defines 0, = 0,(1) as the 
probability limit of Êr (1) in Theorem 4.1 below. Before we present that result, it 
is worth noting two ways in which Assumption 4.2 differs from the combination 
of Assumptions 3.3 and 3.4. First, Assumption 4.2 does imply a specific value 
for E[f (vi, @.)| — although it does imply that || E[f (vt, 0.)] ||< co. Secondly, in 
misspecified models, there is no reason why the same parameter value should 
minimize Qo(@) for two different choices of W. Therefore, in general, 6, is 
determined in part by W.® 


Theorem 4.1 Convergence of 67(1) 
If Assumptions 3.1-3.2, 8.7-3.10, 4.1 hold and 4.2 holds for 0, = 6,(1) then 
6p (1) > 6,(1). 


As anticipated above, the proof is split into two parts along similar lines 
to the proof of Theorem 3.1. Part (i) uses the definition of the estimator and 
Lemma 3.1 to deduce that 


jim Pl0< Qo(Or(1)) < Qo(O.(1)) + €] = 1 for any e > 0 (4.4) 
Part (ii) uses (4.4) and Assumption 4.1 to deduce that 67(1)  6,(1). The 
details are left to the reader. 


6 See Section 4.4 for futher discussion of this issue in the context of the two step or iterated 
estimator. 
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4.2 Asymptotic Distribution Theory for the 
First Step Estimator 


In Section 3.4.2 it is shown that T!/?(6p — 09) converges to a normal distribu- 
tion if the model is correctly specified. In this section, we develop an analogous 
limiting distribution theory for the first step estimator when the model is mis- 
specified. It emerges that the weighting matrix plays a far more fundamental 
role in misspecified models, and this complicates the analysis. This dependence 
is present at each step of the GMM estimation, and so the first part of the anal- 
ysis is not specific to the first step estimator. Therefore, we adopt the generic 
notation of Or for the estimator and @, for its probability limit for most of this 
analysis and then specialize the results to 67(1) at the end. This section is based 
on results in Hall and Inoue (2003) to which the reader is refered for rigorous 
proofs of the main results.” 

As in the previous section, we need to determine appropriate conditions un- 
der which to perform the analysis. Once again, the logical starting place is the 
corresponding analysis in correctly specified models. Inspection of the regularity 
conditions in Theorem 3.2 reveals that many of them do not involve the specifica- 
tion of the model per se. In particular, Assumptions 3.1, 3.2, 3.7-3.10, 3.12-3.13 
impose regularity conditions on v+, © or the behaviour of f(.),Of(.)/00' over 
©. Therefore, we can equally well impose those assumptions here. Obviously, 
Assumptions 3.3-3.4 depend on the model specification, and, as in the previous 
section, we replace them with Assumption 4.2. Once this is done, we can invoke 
Theorem 4.1 to deduce that Êr = 04. The nature of this limit will have an 
impact on our analysis. It can be recalled from Section 3.4.2 that the analysis 
started with the Mean Value Theorem applied to gr (Or) around 09. We use a 
similar starting point below but take the linearization around 6,. So we must 
replace Assumptions 3.5, 3.12 and 3.13 by the following assumption.® 


Assumption 4.3 Regularity Conditions on Of (v;,0)/00' 

(i) The derivative matrix Of (v,0)/00’ exists and is continuous on © for each 
v € V; (ü) 0, is an interior point of O; (iii) E[Of (v1,0.)/00'| exists and is 
finite; (iv) E[Of (v1, 0)/06'] is continuous on some e€-neighbourhood Ne of 0x; 
(v) suppen, |Gr() — EOF (ve, 4)/00']|| > 0. 


Once the linearization is taken around 6,, it is the behaviour of T!/?gr(0.) 
which becomes relevant. Accordingly, we define 


Elf (ve, Ox) = be (4.5) 


Notice that Assumption 4.1 implies uw. 4 0. We must also replace Assumption 
3.11 and Lemma 3.2 by: 


7 Hall and Inoue’s (2003) results subsume earlier work by Maasoumi and Phillips (1982); 
the latter paper presents the limiting distribution of the IV estimator in the linear regression 
model with Wr set equal to the inverse of the instrument cross product matrix. 

8 Part (i) is identical to Assumption 3.5(i). It is repeated here to simplify the presentation. 
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Assumption 4.4 Properties of the Variance of the Sample Moment 


(i) E(f (v1, 04) — us) (F (v4; 9%) — ux)] exists and is finite; 
(ii) limp_..Var{(T'/?g7(6.)] = Sx exists and is a finite valued positive definite 
matrix. 


Lemma 4.1 Central Limit Theorem for T~!/? EL (foe 0.) — [x] 
If Assumptions 3.1, 3.8, 4.1, and 4.4 hold then T~\/? SL [Un Ox) — Hx] = 
N(0, 5). 


With all these assumptions imposed, we can now proceed to the analysis. As 
mentioned above, we begin by using the Mean Value Theorem to deduce that 


gr (Or) = gr(Ox) + Gr (Or, 0+, r)(Or — Ox) (4.6) 


where Gr(Êr, 8s, Ar) is (q x p) matrix whose it row is equal to the it row 


of Gr (0?) where a = 6, +(1- br for some 0 < aG < 1, and 
i = 1,2,...q. It is then possible to apply the same sequence of arguments as in 
Section 3.4.2 to show that (4.6) leads to 


T? (ôr — 0.) = —[Gr(Or) WrGr (Ôr, 0x, Ar)] Gr (Ôr) WrT?gr(04) (4.7) 
It is convenient to rewrite (4.7) as 
T!/? (67 — 6.) = Hor{ Hir + Hor} (4.8) 
where 

Hor = —[Gr(6r) WrGr(Êr, 44,7)" (4.9) 

T 
Hır = Gr(Or) WrT "X [f (ve, 04) — Hel (4.10) 

t=1 
Hər = T'?Gr(6r) Wru. (4.11) 


It is instructive to compare (4.8) with the corresponding equation in our analysis 
of correctly specified models, (3.26). The term Ho,rHı,r can be recognized as 
the analog to the right hand side of (3.26), and so misspecification has introduced 
a second term, Ho,rH2,r, into the equation.? To proceed further, it is useful to 
decompose Hə r as follows: 


Hər = Hor(1) + Hor(2) + Hor(3) + Hor(4) 4.12) 
where 
Hə 7(1) T21G7(6r) — Gr(6.)] Write 4.13) 
Har?) = T'?[Gr(0.) — G] Wru 4.14) 
Ho7(3) = G TY?(Wr — W) ts 4.15) 
Hə r(4) = TYG W ux 4.16) 


9 Notice that if the model is correctly specified then jx = 0 and so Hə 7 = 0. 
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At this stage, we can take advantage of two simplifications. First, the population 
analog to the first order conditions imply Hə r(4) = 0. Secondly, Hə 7(1) can 
be written ast? 


Hor) = (u,Wr @ I,)vec{T/?[Gr(6r) — Gr(6.)]} 
= (Wr @1,)GP (Êr, 0s, ¢r)T"? (Or — 0.) 
MrT? (Êr = 04), say 


where GE (Or, 6,,¢7) is the pq x p matrix whose ith row is the corresponding 
row of (0/00 Jvec{ð f (v, P) /00 } with Ë = 6 dr+(1—6?)0,,0< oP <1, 
and ¢r is the pq x 1 vector with i*” element go. 

Taking advantage of these two simplifications, (4.8)—(4.16) can be used to 
deduce that 


T? (ôr — 0) = [Ip — HorMr]~'Ho,r {ir + Hor(2) + Ho,r(3)} (4.17) 


Intuition suggests that [Ip — HorMr|~'Ho,r converges in probability to some 
matrix of constants and H, 7 converges in distribution to normal vector under 
our conditions. It is also reasonable to assume that T!/?(G(6,) —G,] converges 
to a normal limiting distribution under certain conditions, and so Hə 7(2) ex- 
hibits the same property. The key question is the limiting behaviour of H2,7(3). 
From (4.15), it is clear that the limiting behaviour of H2,r(3) depends on that 
of T!/2(Wr — W). In order for T!/2(6p — 04) to converge in distribution, it is 
a necessary condition that T!/?(67 — 8.) = Op(1). From (4.17), it is clear that 
such a condition can only be satisfied if T!/?(Wr — W) = O,(1). Therefore, if 
Wr converges to W at a slower rate than T1? then T"? (Êr — 0,.) must diverge. 
This dependence of T!/2(6p — 0,.) on T!/2(Wp — W) is in marked contrast to 
what is found in correctly specified models, and is directly attributable to the 
presence of Hə p in (4.8). 

To make further progress, it is clearly necessary to make some assumption 
about the nature of the convergence of Wr to W. We focus on two particular 
scenarios which both satisfy T!/?(Wr — W) = O,(1) and together cover the 
choices of first step estimator used in our empirical example in Chapter 3. The 
first scenario is where Wr = W, which obviously covers Wr = Ig, and the second 
is where T!/?(W — W)us converges to a normal distribution, which we show 
below covers Wr = [Tt D zz]! under plausible assumptions. However, 
before we present these results certain other conditions must be imposed. To 
ensure that Ge Gp, 6,,¢7) converges to a well-defined limit, we impose: 


Assumption 4.5 Regularity Conditions for Go (0) 
(i) (0/00')vec{Of (v1, 0)/30'} exists and is continuous on © for each v € V; 
(ii) E|(0/00’)vec{Of (vz, 0)/00'}] exists and is continuous on O; 


10 Dhrymes (1984) [Corollary 25, p.103] and the Mean Value Theorem applied to the i— jt” 
element of Gr (Or). 
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(iti) supgen, ||GY? (0) — E[(0/06")vec{Of (v1, 0)/00'}]|| È 0 where N, is an e 
neighbourhood of 6... 


It is also necessary to ensure that the the inverse matrix in (4.17) is well defined 
in the limit. Therefore, we impose: 


Assumption 4.6 Regularity Conditions on H, 

The p x p matrix H, = GWG, + (LW Q aP is nonsingular where G, = 
ElOf (v1, 9«)/00') and GP = E[(0/06’)vecf Of (v, 04) /06’}]. 

Assumption 4.6 is satisfied if Qo(0) satisfies the second-order sufficient condition 
for minimization at 0. It is also necessary to impose certain conditions in order 


for Hə r(2) to converge to a normal distribution. For ease of exposition, we 
impose those conditions implicitly in the statement of the following theorem. 


Theorem 4.2 Limiting Distribution of the First Step Estimator 
Let Assumptions 3.1, 3.2, 3.71-3.10, 4.1-4.6 (with 6, = 6,(1)) hold. 


(i) If Wr =W and 


magta eiw | *(( vs tas )): 


then it follows that 
T? (ôr — 6,) N(0, 51) 


where 


Dı = H7! (GW S,W G, + GLWVi2 + VaiWG. + V22)H,"! 


(ii) If 
Yas Dilts, 64) — [x] P S, Viz Vis 
TY?21G7(04) — G|! W ux >N|0O| Var V2 V23 
T? (Wr — W) ht V3,1 V32 V33 
then 
T? (Êr — 0.) S N(0, H7 '£2H;'), 
where 
Eo = GLWS.WG, + V22 + GLV3 3G. + GW V1,2 


+G! WV, 3G. + VoiWG, + G! V3 1W Gx + V23G. + G! V3.2 


It is interesting to compare the results in parts (i)-(ii). First recall that 6,(1) 
depends on W. Secondly, the structure of covariance matrices is different. So, 
in general, the limiting distributions in (i) and (ii) are different. However, there 
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is one obvious exception: if uw, = 0 — i.e. the model is correctly specified — then 
6,,(1) = fo and both variances reduce tot! 


= (GLWG,) 1(GLW S,W G.) (GIW G)! (4.18) 


which can be recognized as the variance in Theorem 3.2 (if we put 0, = 0o). This 
comment also implies that, in general, the first step estimator has a different 
distribution in correctly specified and misspecified models. 

It is remarked above that Theorem 4.2 covers the case in which the weight- 
ing matrix is the inverse of the instrument cross product matrix under plau- 
sible assumptions. To uncover the nature of these conditions, we let Wr = 
(Tye az), W = Mz? = {E[zz!]}-1, and rewrite T!/2(Wr — W) as 
follows 


T T 
T? (Wr — W) = -M3 TPA (az - To Hl T1 (4.19) 
t=1 #=1 


From (4.19), it can be seen that this case is covered by Theorem 4.2(ii) provided 
that vech{T~!/2[S-/_, x24 — Mzz]} converges to a mean zero normal distribu- 
tion.” 


4.3 Long Run Covariance Matrix Estimation 


Section 3.5 described various estimators of the long run variance of the sample 
moment. These estimators were grouped into three classes according to the 
assumption made about the dynamic structure of f(v, 0). However, all the 
estimators have one feature in common: they are constructed under the as- 
sumption that the model is correctly specified. Once we move into the world 
of misspecified models, none of the proposed estimators are consistent even if 
they are based on a correct assumption about the dynamic structure. This 
section describes the impact of misspecification on each of the covariance ma- 
trix estimators, and explains how they can be modified to ensure consistency in 
misspecified models. Gallant and White (1988)[Chapter 6] consider the impact 
of model misspecification on covariance matrix estimation under very general 
conditions, and some of our discussion represents a specialization of their results 
to stationary processes. 

It is shown below that the exact impact of misspecification on each covariance 
matrix estimator is different. However, it is possible to gain a sense of both the 
problem and the solution by examining a single autocovariance matrix. By 
definition, the jt” autocovariance matrix of f(v;, 0.) is 


Py = BLE F (v1.84) — HSO 4x) — He} 
= BU f (v1.44) fv, 0)] Mat (4.20) 
11 Notice that 4. = 0 implies Vij = 0. 


12 vech{.} denotes the operator which stacks the lower triangular elements of a matrix into 
a vector. 
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Suppose we estimate I; by 


T 
Tr; = T! 5 fve Or) fvi Âr) (4.21) 
t=j41 


This statistic is a consistent estimator of the first term on the right-hand side 
of (4.20) but therefore an inconsistent estimator of Tj because yu. 4 0. Given 
(4.20), the obvious solution is to estimate I; by 


r= TT Sy Fon Âr) — gr (Or)|[f w5 ôr) — gr (6r)) (4.22) 
t=j+1 


As would be anticipated, this estimator is consistent for T}. Also notice that 
if the model is correctly specifed then [; = I; + 0(1), and so there is no cost 
asymptotically to using the mean correction when it is unnecessary. 

It is useful to introduce a terminology to capture the difference between Î; 
and T;. The key difference between them is that the data, {f (vs, Or) are “cen- 
tred” bent their mean gr (67) in [; but they are “uncentred” in T;. Therefore, 
we refer to T; as the centred version of the sample autocovariance and T; as the 
uncentred version. These adjectives are similarly used to distinguish covariance 
matrices based on uncentred or centred autocovariances. We now examine the 
behaviour of each of the covariance matrix estimators from Section 3.5 in turn. 

If {f(v:,0«)} forms a serially uncorrelated sequence then S, = Io. Since 
Ssu = Fo; it follows from (4.20) that 


Îsu È S. + phs (4.23) 
Equation (4.23) indicates that Sz converges to a positive definite matrix of 


constants — but obviously not Są. However, given the discussion above, it is 
clear that a consistent estimator for S, is given by: 


| 


suu = T'S Uf (ve, Êr) — or (Or) |[f (ve, Ôr) — gr Or)] (4.24) 


Now consider the impact of misspecification on den Haan and Levin’s (1996) 
estimator. For this discussion, it is convenient to focus on the case where fi is 
actually generated by 

W(L)( fr — we) = O(L)er (4.25) 
where the matrix polynomials satisfy the conditions for stationarity and in- 
vertibility in Section 3.5.2 and e; satisfies the properties listed there as well. 
Starting from (4.25), it can be deduced along similar lines to Section 3.5.2 that 
fı satisfies the autoregressive model 


ACL) (fi — Hx) = et (4.26) 
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A comparison of (4.26) with (3.50) indicates that there are now two sources 
of misspecification in the autoregressive model used in Step 2 of den Haan and 
Levin’s (1996) method. Apart from the truncation error, there is the omission of 
the intercept. Unlike the truncation error, the problems caused by the omission 
of the intercept cannot be removed by letting the autoregressive lag length tend 
to infinity with the sample size. Intuition suggests this type of misspecification 
causes Sy ARMA to be an inconsistent estimator of S,. Unfortunately, a formal 
investigation of this question is complicated by the presence of the lag selection 
criterion in Step 3 of den Haan and Levin’s (1996) method. However, once again, 
consistency is restored by applying a mean correction. This time, the correction 
is implemented by applying den Haan and Levin’s method to f(v, Êr) in mean 
deviation form. 

Finally, we consider the impact of misspecification on the class of HAC 
estimators — both with and without the use of prewhitening and recolouring. 
We begin with the uncentred HAC estimator 


T-1 
Suac = o + 5 Wi T (î: + ") (4.27) 
i= 


In Section 3.5.3 it is observed that the kernel, w,; 7, and bandwidth, br, must be 
carefully chosen to ensure the estimator is consistent. However, this comment 
is conditional on the assumption that T; is a consistent estimator of Tj. As we 
have seen, this premise is only valid if the model is correctly specified. This 
means inevitably that ô Hac is itself no longer a consistent estimator. While 
the source of the inconsistency is the same as with s su, the consequences are 
more drastic because of the increasing bandwidth. Using results in Gallant and 
White (1988) [Chapter 6], it can be shown that 


SHAC = Sx F Brusy, + Op(1) (4.28) 
where Br = 1 + DE wir. It can be shown that Br increases at rate br 
for either the Bartlett, Parzen or Quadratic Spectral kernels. So in these cases, 
SH Ac is asymptotically equivalent to the sum of two matrices: S,, a positive 
definite matrix of constants, and Bry.j,, a rank one matrix of O(br). While 
S,+ Brix u. is positive definite for finite T, it is clear that the rank one matrix 
dominates in the limit as T — oo. In the next section, it is shown that (4.28) 
has an important implication for the limiting behaviour of Sic which in turn 
affects the limiting behaviour of the two step GMM estimator. For the present, 
we focus instead on how to modify the estimator to ensure consistency even if 
the model is misspecified. Once again, the answer is straightforward: replace 
Î; in (4.27) by Tj from (4.22). This yields the centred HAC estimator,!* 


T-1 
Suacn = Fo + X wir (Fs +) (4.29) 
{=t 


13 Hall (2000) proves this estimator is consistent with either the Bartlett, Parzen or 
Quadratic Spectral kernel and br —> co with T but br = o(T}/2), 
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4.4 The Two Step or Iterated GMM Estimator 


In this section, we consider the implications of misspecification for the prob- 
ability limit of the two step or iterated estimator. The exact nature of this 
transmission mechanism depends on which covariance matrix estimator is used. 
For reasons that emerge below, we split the analysis into two parts. Section 4.4.1 
considers the case in which f(v;, 6.) — ux is a serially uncorrelated process, and 
either Sou or Ssup is used to construct the weighting matrix. Section 4.4.2 con- 
siders the case in which either an uncentred or centred HAC estimator is used 
to construct the weighting matrix.‘4 It emerges that the behaviour in these 
two cases is very different, and also very different from the behaviour of the first 
step estimator. In this discussion, it is necessary to distinguish various functions 
of the parameter vector evaluated at different steps of the estimation. There- 
fore, we define u,(i) = u(04(i)), S,(i) = limp... Var[T-1/? a ff (ve, 94 (4))], 
Toli) = Var[f (vz, 0«(4))], and let (i), Žo(i) denote respectively the uncentred 
and centred zero order sample autocovariance matrices evaluated at Or(i). 


4.4.1 Estimation with Wr = San or Wr = eae 


It is most convenient to develop the analysis under the assumption that f (vt, 0.) 
is a serially uncorrelated sequence and so S,(1) = To(1). In this case, the 
inconsistency of Ssu stems solely from u, # 0, and not from an incorrect 
assumption about the dynamic structure of f(v, 0«)— ux. However, some of 
the results hold more generally and so we relax this assumption briefly at the 
end to consider the impact of dynamic misspecification. 

We begin our discussion with the second step estimator, 67(2). Recall from 
Section 3.6, that 67(2) is calculated using Wr = Sp(1)~! where $p(1) is an 
estimator of the long run variance based on Êr(1). Therefore, the population 
analog to the second step minimand is given by: 


QPO) = ELf(v:, OWO EJS (vr, 0) 4.30) 


where W®) = {plimr o $r(1)}~!. Then from Theorem 4.1, (4.21) and (4.22) 
it follows that 


1 


(1) + us(1)us(1) 4.31) 
.(1) 4.32) 


Ssu = To 
Ssup = To 
and so!® 
z B SAD- (1) S61) Apa (Le (1) S1) (4.33) 
= SWD, say 
Sba > S(t)? 4.34) 


14 We do not explicitly consider the case in which den Haan and Levin’s (1996) estimator 
is used because, as mentioned above, the presence of the lag selection method complicates the 
analysis. 

15 For example see Morrison (1976) [p.69]. 
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where c,(1) = [1 + Hs (1) "S,(1)~!.(1)]7!. Inspection of (4.33)—(4.34) reveals 
that both Sz su and Se su,, Converge in probability to positive definite matrices 
of constants and so satisfy the conditions for a weighting matrix specified in 
Assumption 3.7. It is also apparent that W) is different in each case, and 
intuition suggests this difference should also manifest itself in the probability 
limits of the associated two step estimators. It is hard to confirm or disprove this 
intuition by looking at QP. However, more progress can be made by E to 
the population analog of a first order conditions. To this end, let as ) denote 
the unique minimizer of Qk 9 0) when W) = §,,(1)~!, and ale e) be the unique 
minimizer of QY (9) when W®) = S,(1)71.16 In order to characterize these 
values by the first order conditions, it is necessary to assume that Assumption 
4.3(ii)-(iii) hold at both 6S and 6S. Once these conditions are imposed, it 
follows that gh) is the solution to the first order conditions 


1 


G(9) S.(1) ELE (vi, 0)] — cx (1)G(O) Su (1) "pte (L) te (1) S(T EL 0) = 0 
and 6° is the solution to 
G(0)'S.(1)* Elf (v1, 9)] = 0 (4.36) 


Inspection of (4.35) and (4.36) reveals two features of the probability limits: in 
general, af x a? and neither equals @,(1), the probability limit of 6p (1).17 
However, there is one exception which should be noted. If 6,(1) satisfies (4.36), 
then it also satisfies (4.35), and so 0") = 6{? = 0,(1). Such a coincidence 
would occur if the first step weighting matrix is of the form kS,(1)~! for some 
constant k, but this is unlikely to be the case in general. The equality between 
the two probability limits can also occur if the estimation is iterated beyond 
two steps. If both the iterated Pareto based on Wr = Gan and the iterated 
estimator based on Wr = Sq, 30. individually converge then it can be shown 
using appropriately modified versions of (4.35) and (4.36) that both estimators 
have the same probability limit. 

Now consider the miting distribution of the second step estimator. Regard- 
less of whether Wr = Sony or Wr = Sarai , it is possible to establish that the 
second step estimator has a limiting normal distribution under plausible con- 
ditions. For brevity, we focus on the case in Vee Wr = Sz SU, u but a similar 


argument applies for the case in which Wr = Se Prag The argument is based on 
an appeal to Theorem 4.2(ii). Using the same trick as (4.19), it can be shown 
that the limiting distribution of Êr(2) is given by Theorem 4.2(ii) provided 
vech{T'/?(I'9(1) — Po(1))} converges to a normal distribution. However, this 
appeal to Theorem 4.2(ii) is not so benign as it at first appears. Using the Mean 


16 The superscript on 6. reflects whether the covariance matrix is uncentred or centred, 
and the ‘(2)’ argument is suppressed for ease of notation. 

17 Recall that if the model is correctly specified then the probability limit of all three 
estimators is 09. 
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Value Theorem, it can be shown that 


vech TYČ) —Lo(1)]}. = vech{T"/?(Fo,r(1) — Fo(1)}} 


0 
(8/30 jvech{To,r(6.(1))}T™? lôr (1) — 6.(1)] 
+ op(1) (4.37) 
where To, r (0) = T7! D Ilf (vi, 0)— w(O)][ f (v2, 0)— w(0)]’. Therefore, the large 


sample ae of T'/2[67(2)—0,(2)| depends on the large sample behaviour of 
T/?(67(1) —0.(1)] unless (0/06 )vech{To,7(0.(1))} È 0. In general, there is no 
reason to suppose that this condition holds. A similar argument can be applied 
to the iterated versions of these estimators to deduce that the limiting distri- 
bution of T!/?[67(i) — 0, (i)] depends on {Tőr (j) — (j), j = 1,2,...i— 1} 
in general. Needless to say, this recursive structure must be taken into account 
in the calculation of the asymptotic variance of the estimator. However, we do 
not pursue the form of this asymptotic variance further here. 

So far, it has been assumed that f(v+, 0+) is a serially uncorrelated sequence. 
If this assumption is relaxed then S,.(1) must be replaced by To(1) in (4.35) and 
(4.36). However, this substitution has no qualitative impact on the foregoing 
analysis of the probability limits of the estimators, and so all the conclusions 
remain valid in this more general case. The assumption of no serial correlation 
also has no qualitative impact on the appeal to Theorem 4.2(ii) to deduce the 
asymptotic normality. However, its relaxation introduces a dynamic structure 
in f(v, 0x) — x which must be accounted for in the definitions, and also the 
estimation, of the covariance matrices V; j in Theorem 4.2(ii). 

To conclude, this sub-section we examine the impact of using Sau. „ In our 
empirical example. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

Table 3.7 in Seion 3.6 reports the results from the two step and iterated esti- 
mations with Sz} su used as weighting matrix. Table 4.1 contains the analogous 
results when So} su, 18 used. With equally weighted returns (EWR), conver- 


gence takes 5 and 4 iterations respectively with we = 10°J5 and wh) = 
(T-* 57 242,)71. With value weighted returns (VWR), one less iteration is 
needed in each case. If the model is correctly specified, then the probability 
limit of the estimator is the same on all steps. The results in Table 3.7 indicate 
the iterated estimator converges to the same values for a given asset irrespective 
of the the first step weighting matrix. Our analysis in this sub-section indicates 
that if convergence occurs then the probability limits of the iterated estimators 
should be the same regardless of whether the weighting matrix is either Say or 
So. p even if the model is misspecified. These arguments lead us to expect that 
the corresponding estimates should be close in large finite samples irrespective 
of whether the model ultimately proves to be correctly or incorrectly specified. 
A comparison of Tables 3.7 and 4.1 indicates the iterated estimates are iden- 
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tical to three decimal places for VWR and to two decimal places for EWR. © 


Table 4.1 
Two step and iterated GMM estimators for the consumption based 
asset pricing model with EWR and VWR 


EWR: 
wh) 4,6) for i=1 (4,6) for i=2 (4,6) after iteration 
10° Ts —3.145,0.999) (—0.253,0.992) (—0.344, 0.992) 

(T-1 977, zz)!  (0.398,0.993)  (—0.335,0.992) (—0.344, 0.992) 

VWR: 
wh) 4,6) for i=1 (4,6) for i=2 (4,6) after iteration 
10° 1; —1.871,0.998) (0.716, 0.993) 0.666, 0.994) 

(T-1 77, zz)!  (0.698,0.994) (0.666, 0.994) (0.666, 0.994) 

Note: wi) denotes the first-step weighting matrix. 


4.4.2 Estimation with Wr = Gag or Wr = iE 


Now let us consider the same questions in the cases where either an uncentred 
or centred HAC estimator is used to construct the weighting matrix. Although 
there are some similarities between the two cases, there are sufficient differences 
to necessitate a separate treatment for each. It emerges that the distribution 
theory is very different from the cases considered above and non-standard in the 
sense that the estimator no longer converges at rate T~!/?. In this sub-section, 
we concentrate on explaining the sources of these differences and so only provide 
heuristic arguments to justify the stated results. A more rigorous treatment can 
be found in Hall (2000) and Hall and Inoue (2003). 


4.4.2.1 Estimation with Wr = Sean 


First notice that Sue , Satisfies the conditions for a valid weighting matrix 


given in Assumption 3.7 because by construction, Ŝu AC,„ is positive semi- 
definite for finite T and converges in probability to the positive definite matrix 
S,. This suggests that we can appeal to similar arguments as in the proof of 
Theorem 4.1 in order to deduce that Êr(2) converges in probability to some 
value. 
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Corollary 4.1 Probability pees of 67(2) 


Let Wr = SHAG and Sac. “i Pg, (1), a positive definite matriz. If Assump- 
tions 3 1-3.2, 3. 8- 3.10, 4.1 hold and Assumption 4.2 holds at 0, = @.(2) then 


Ôr(2) 5 0,(2). 


Notice that in general 0,(2) # 0.(1), the probability limit of the first step 
estimator, unless the weighting matrix on the first step, Wr (1), is proportional 
to S,(1)~!. In practice, there is no reason to suppose that Wr(1) = kS,(1)~+ 
except by coincidence and so the probability limits of the first and second step 
estimators are different in most circumstances. 

Now let us consider the limiting distribution of 67(2). In the previous sec- 
tion, it is shown that Theorem 4.2 can be invoked to deduce the asymptotic 
normality of the second step estimator when Wr equals Szy or Sau w However, 
such a strategy does not work here. The key difference is in the rate of conver- 


gence of Wr to W. While Soup converges to ['9(1)~! at rate TN HAC u 
converges to S.(1)~! at a slower rate. This means that T!/2 [34c a S] 
diverges as T — 00, and hence T'/?(67(2) — 6.(2)] does the same. Therefore, in 
order to derive the limiting distribution, we must scale 67(2) — 6,(2) by some 
other function of T which increases at a slower rate than T!/?. 

Since a similar story is going to emerge when Wr = ae — albeit with 
a different rate of convergence — it is more convenient to develop the analysis 
at a general level and then Sperialize the derived result to deduce the limiting 
distribution when Wr = S77! HAC we Accordingly, we consider the case in which 
Wr converges to W at rate cr where cr is a sequence of constants with the 
properties cp > oo with T — œ and er = o(T*/2). We also return to the generic 
notation of Ôr for the estimator and 0. for its plim to facilitate comparsion with 
Section 4.2. Our starting point is (4.8) with cr substituted for T!/?, that is 


II 


(cr/T'?)Hor{ Wir + Hor} 
(er/T"?) Hor Air F (er/T!?) Ho r H2 T (4.38) 


CT (Or = 6,.) 


where Hor, Hi,r and Hə r are defined in (4.9)—(4.11). We now consider the 
behaviour of the two terms on the right-hand side of (4.38) in turn. In Section 
4.2 it is shown that Hp rHi,r = O,(1), and an inspection of the argument 
reveals that this conclusion did not depend on the rate at which Wr converges 
to W. Therefore, the same arguments can be used here. However, since this 
term is multiplied by cp /T'/? and cp = 0(T/?), it follows that 


(cr/T'?) Horr > 0 (4.39) 


Now consider (er/T'/?) Ho. rHo,r. Using (4.12)-(4.16) and Hə r(4) = 0, it 
follows that 


(er/T?) Ho rH r = (er/T'")Hor{ Her (1) + Ho,r(2) + H2 r(3)} (4.40) 
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Using similar arguments to Section 4.2 to analyse Hp rH2,7(i),i = 1,2, it can 
be shown that 

(er/T'/*) Ho,r[H2,r (1) + H2,r(2)] = Ho,rMrer(6r — 94) +0p(1) (4.41) 
Therefore, combining (4.38)—(4.41), it follows that 

cer(6r — 9x) = [Ip — Hor Mr] Ho.r (cr/T!/?)H2,7(3) + op(1) (4.42) 


Just as in Section 4.2, Hor and Mr converge in probability to matrices of 
constants (under certain conditions) and so (4.42) implies the limiting behaviour 
of cr(@r — 0x) is driven by 


(cr/T\/?)Ho.7(3) = Gyer(Wr — W) (4.43) 


To proceed further, we must make some assumption about cr(Wr — W) and 
so we return to the specific example of interest here. Using a similar argument 
to (4.19), 


er[Szacy, SDT] = — Sac, ferlSzac — SANY (4.44) 


and so it suffices to consider cr[S#Ac.. — Sx(1)] because S,(1)~! = O(1) and 
on Ags = O,(1). To this end, it is useful to introduce the following notation. 
We define 


T-1 
Sar = Tort 5 Wi, T (Tir + ir) 
fi 
T-1 
Sır = Tot So wir (T: + r4) 
i=1 


where Pir = T EPa lf 0e 0) — gr (0LF oii, 0+) — gr(B.)//. Using these 
definitions, cr(SHAC,u — S+(1)) can be decomposed into the sum of three terms 
as follows, 


crT(SHAC,u — S+(1)) = er(SHacyy — Sx T) + ET(Sx, T — Sx, T) + er (Ser — 9x(1)) 
(4.45) 
Notice that the first component, Sq AC — Sx T, represents the difference be- 
tween the HAC evaluated at 67(1) and 6,(1); the second component, Sa r — Ss T 
is the difference between the HAC evaluated at 6,(1) and the corresponding 
function evaluated at population instead of sample autocovariances; and the 
third component, S4 r — S+(1) is the difference between the population analog 
to the HAC and the long run covariance matrix. Notice that the sum of the 
first two components is S HAC, — Sx, T, and so can be interpreted as the error 
inherent in using the HAC estimator to estimate its population analog. The 
third component can then be interpreted as the bias induced by estimating S4, T 
instead of S,(1). 
Hall and Inoue (2003) verify that under a set of plausible regularity condi- 
tions the three components in (4.45) behave as follows. 
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Assumption 4.7 Limiting Behaviour of the Components of SHAC — 
S,(1) 


1. (T/br)'/2vech(Spacy — Ser) = (1). 


2. (T/br)\/2vech(S,..7 — Ser) a N(0,9Q7) where Q, is a positive definite 
matrix depending on the kernel w(.). 


3. limp oo bE (Sx r — S4(1)) = C where k > 0 is known as the characteristic 
exponent of the kernel w(.),'° and 


‘ 1— w(x) = “tk 


Before proceeding, it is worth briefly commenting on certain aspects of this 
assumption. In all our previous invocations of asymptotic normality, such as 
Lemma 4.1, the rate of convergence has been T~!/?. The key difference here is 
in the form of SHAC — S,(1). Recall that SHAC u is itself a weighted sum of 
T — 1 autocovariances (and their transposes). While we can apply the Central 
Limit Theorem to deduce T!/2vech{P;—I;} converges to normal distribution for 
fixed i, the rate of increase in the number of autocovariances included in S HAC,p 
slows down the rate of convergence.!? Notice also that the rate of convergence 
of all three components depends on the bandwidth, and the behaviour of the 
second and third components also depends on the kernel. 

Using equation (4.45) and Assumption 4.7, it follows that the limiting be- 
haviour of er(S#4c,, — Sx(1)) depends on the bandwidth and the kernel: 


© if limp soo T!/2/b4/?** = 0 then (T/br)!/? (Înca — Sx(1)) 4 N(0, Qu); 
© if limmo T!/?/by/?** = $ € (0,00) then (T/br)!/? (În Acu — S2(1)) S 
Hk 


e if imr T!/2/b4/?** = œ then plimp.obk (Sac — S+(1)) = C. 


Notice that neither the rate of convergence nor the nature of the limiting be- 
haviour is the same in all three cases. In particular, if limT—oo T! Jb}? = 00 
then the bias term, S% r — S+ (1), becomes dominant and this causes bË. (Sm AC, u— 
S,,(1)) to converge to a constant. As would be anticipated, these differences also 
manifest themselves in the limiting behaviour of the estimator. Using (4.42)- 
(4.45) and Assumption 4.7, the following three possibilities emerge for the lim- 


iting behaviour of 67(2). 


18 Anderson (1994) [Section 9.3.2] defines the characteristic exponent and discusses its prop- 
erties. 
19 For further discussion see Andrews (1991) or Hall and Inoue (2003). 
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Lemma 4.2 Limiting Behaviour of 67(2) When Wr = Sac 


Assume that: (i) Wr = Son: and SxAacy > S.(1), a positive definite matriz; 
(ii) Assumptions 3.1, 3.8 and 4.7, and certain other regularity conditions hold.?° 
The limiting distribution is as follows: 


if limp sooT 2 bH? = 0 then (T/br)!/2[67(2) — 6(2)] 4 N(0, 3); 


© if limpoooT/?/byl?** = ¢ € (0,00) then (T/br)/?[6r(2) — 6.(2)] S 


N (Hz) G! C Hax, Xs); 
© if limp coT 2 bt" = 00 then bK JÔr(2) — 04(2)] 2 H71G! pO lss; 


where X3 = Hz! D'BQB'DH.7!, D = —(p,(2)'S,(1)!@G, S.(1)~!), B is the 


selection matriz defined by vec{ S4} = Bvech{ S4}, Hax and G,. are respectively 
H., and G, in Assumption 4.6 evaluated at 0, = 0.(2) instead of 0.(1). 


It is interesting to contrast cs result with the corresponding discussion in 
the case where Wr = So gu or Gol SU, Notice that unlike those previous cases, 
the asymptotic distribution of the cond step estimator does not depend on the 
first step estimator. The reason is that one of the regularity conditions behind 
Assumption 4.7 is the restriction that 67(1)—0.(1) = Op(T~'/?).2! This means 
that (T/br)'/?[6r(1) — 64(1)] = op(1), and so can have no effect on the large 
sample behaviour of (T/br)*/?[67(2) — 0,(2)]. 

We now consider the iterated estimator. It is straightforward to extend 
Corollary 4.1 to 67(i). However, the limiting distribution of the iterated esti- 
mator is going to be very complicated in general. Using a similar argument to 
(4.37), it follows that if limp_..T!/?/bi/?** € [0, 00) the limiting distribution of 
(T/br)*/?[(é) — 9, (i)] depends on {(T/br)" P rG) - 0.0), j = 23,- -i= 1} 
in general. Notice that, this time, the dependence only goes back to the second 
step for the reasons discussed above. 


4.4.2.2 Estimation with Wr = ie 


We now consider the case in which the second step estimator is calculated using 
the uncentred HAC estimator based on Êr(1). To begin, we must consider 
whether Cra satisfies the conditions for a valid weighting matrix given in 
Assumption 3.7. Since this part of the analysis is generic to all steps, we return 
to our more general notation of Or for the estimator and 6, for its limit. In 
Section 4.3.3, it is shown that the large sample behaviour of Siac is identical 
to Sx + Brps uW, The following lemma characterizes the implications of this 
structure for the large sample behaviour of Saio: 


20 These include Êr (1) — 04(1) = Op(T~!/2). See Hall and Inoue (2003) [Theorem 3] for a 
complete list of regularity conditions and also a rigorous proof. 
21 This condition is “plausible” because it is implied by Theorem 4.2. 
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Lemma 4.3 Limiting Behaviour of eae 
If Sac = Ss + Brits, + 0p(1) where Br = 1423015) wir, Br = O(br) and 
the bandwidth satisfies bp > 00, br = 0(T'/?) then: eae 2, S+ where 


1 , 
St = 8) SS et (4.46) 
HyS Hx 
Since the structure of this inverse is non-standard and also important below, we 
present a heuristic proof.?? Since 


Suac = Ss + Brush, + op(1) (4.47) 


and S, + Bry. He is nonsingular for any finite T, it follows that the large sample 
behaviour of Cae can be deduced from (S, + Brjixtt,)~?. For any T, we 
have? 

Br 


T Hx * 


recall that br — co as T — ov, and so it follows from (4.47)—(4.48) that 


â . a À 1 = ‘one 

Hao ~ lim (Sx + Brush) = Sp) — —5—Srt pan. Sr* = S* 

The matrix S* has two properties which play an important role in the 
analysis. 


Corollary 4.2 Properties of St 
(i) rank(S*) = q — 1; (ii) the nullspace of S* is spanned by px. 


Notice that part (i) implies that Sane converges to a singular matrix and so 
does not satisfy the conditions for a weighting matrix laid down in Assumption 
3.7. 

With this in mind, now consider the population analog to the second-step 
minimand when Wr = Sy H pen From Lemma 4.3, this minimand is given by 


QP (0) = Elf (ve, O'SH ELF (ve, 0) (4.49) 


Using Corollary 4.2(ii), it can be seen that QP (0) attains its minimum possible 
value of zero at 0 = 6,. To explore the implications of this structure for the 
estimator, we must impose some form of identification condition. The simplest 
such condition is to assume that this minimum is unique or, in other words, that 
there is no other value of 0 which generates a value of u(0) in the nullspace of Sx. 


Assumption 4.8 Identification Condition 
SHEJ f(v, 9)] 40 for any 0 € © \ {04}. 


22 See Hall (2000) for a rigorous proof. 
23 See the matrix inversion result in (4.33). 
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In some cases it is possible to verify that this assumption holds, but in other 
models its imposition is more an article of faith. Once identification is assumed 
we can use the same sequence of arguments as in Theorem 4.1 to deduce the 
following result. 


Corollary 4.3 Probability Limit of 67(2) 
Let Wr = Saa: If (i) Assumptions 3.1,3.2, 3.8-3.10, 4.1 and 4.8 hold; (it) 
67 (1) S 0.(1); (iii) Saag 2 97; then: O7(2) S 6,(1). 


Corollary 4.3 states that the GMM estimator converges to the same proba- 
bility limit in both the first- and second-step estimations. It is straightforward 
to extend this result to the iterated estimator as well. Therefore if Wr = Jae 
then the probability limits of the estimators exhibit the same type of behaviour 
as they would in a correctly specified model. This also implies that the iterated 
estimation converges after just two steps with probability one. Therefore, the 
second step and iterated estimators are asymptotically identical. One other con- 
sequence of Corollary 4.3 is that there is no need to index population quantities, 
such as px or Sx, by i, and so we drop this index for the rest of this section. 

Now consider the limiting distribution of 67(2). As mentioned in the previ- 
ous section, the rate of convergence is slower than T!/?, and so we must return 
to (4.42) in order to start the analysis. However, this time there are some ad- 
ditional simplications of which we can take advantage. Corollary 4.2(ii) implies 
S+ us = 0 and so both plimr_... Mr = 0 and (Wr —W)us = Wrp.. Therefore, 
(4.42) reduces to 


er(6r — 0%) = HorG,crSqyots + Op(1) (4.50) 


The key question is what is the appropriate choice of cr. To answer this ques- 
tion, it is convenient to rewrite (4.50) as 


er (Or — 6.) = Ho,rG cr Sp! Hx + Ho rG,cr(Saic — Sr Juy + Op(1) (4.51) 


where Sr = S, + Brusy, Hall and Inoue (2003) establish that the following 
results hold under plausible regularity conditions. 


HorG, = —(G,S*+G,)7'G, = O(1) (4.52) 
br 
bos ie = ——— S'u = OO 4.53 
TST ERSS a (1) (4.53) 
(Sie Sr e = Dra r= Snan) ps (4.54) 


= 0,(1)0p(br/T"?)Op(b7") = Op(T-"?) (4.55) 


If these results are used in conjunction with (4.51) then a two-part answer 
emerges to our question. 


Lemma 4.4 Rate of Convergence for Êr(2) 
Let Wr = SE acs If (a) Assumptions 8.1, 3.8,and 4.8 hold; (b) ae 2, S+, 
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(c) Equations (4.52)-(4.55) hold; (d) certain other regularity conditions hold.?+ 
Then (i) br[6r(2) — 9x] > B(G STG.) GL Sz un if @ S7! us #0 where B = 
—limpoo(br/Br)/(u Sz tx); (ti) TY? [6r(2) — 04] = Op(1) if G87" = 0. 


Notice that G‘ S7!u, = 0 implies that plimp_.o 97(1) = 6. is the solution 
to the population analog to the first order conditions when Wr = Brack 
Therefore part (ii) is only relevant in the unlikely eventuality that the probability 
limit of the first step weighting matrix is proportional to the long run variance 
S,.2° So the most relevant part of the lemma in practice is likely to be part (i). 
Lemma 4.4(i) states that br[Êr(2) — 6,] converges to a degenerate distribution, 
or in other words a constant vector. This behaviour is similar to the case when 
Wr = oe) and limp sooT V2 Jb?" = oo, and has a correspondingly similar 
explanation. However, this time it is the bias induced by the use of uncentred 
autocovariance matrices in the HAC that is dominant. 


4.5 The Estimated Sample Moment 


We now consider the large sample behaviour of the estimated sample moment. 
In contrast to the results derived for the estimator, this analysis is uncomplicated 
and independent of the weighting matrix. 

The analysis rests in part on an application of the Weak Law of Large Num- 
bers. This law has not yet been invoked in our discussion of the nonlinear 
dynamic model, and so we now state it formally.”6 


Lemma 4.5 Weak Law of Large Numbers 
Let 0 € O, Elf (v:,@)] = u(0) and Assumptions 3.1, 3.2, 3.8 and 3.10 hold then 


TE f (ur, 8) > uO). 


Let Êr be a GMM estimator and assume it converges to some point in 
the parameter space, 6,. Notice that this definition is sufficiently broad to 
include all the choices of weighting matrix considered above. In this case, it is 
straightforward to establish the following result. 


Theorem 4.3 Large Sample Behaviour of the Estimated Sample Mo- 
ment 

Let (i) Assumptions 3.1, 3.2, 8.8-3.10, 4.1 and 4.8 hold; (ii) Êr > 6, for some 
6. € ©. Then gr(Or) & u(0,) where ||u(O.)|| > 0. 


Proof: 
Using the Mean Value Theorem, it follows that 


gr (Ox) + Gr(Or, 0x, 7) (Or — Ox) 


Ka) 
bf 
D 
lel 
wa 
II 


24 See Hall and Inoue (2003) [Theorem 4]. 

25 Tt is for this reason that we do not characterize the nature of the limiting behaviour 
beyond the given order in probability statement. 

26 See Wooldridge (1994) for discussion of Laws of Large Numbers in dynamic models. 
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The result then follows directly because under the stated conditions 
Gr (Or, 0s, Ar) > Gy = O(1), Or — 0, È 0, gr(Ox) & (84) and ||(8)|| > 0 for 
all 0 € ©. ° 


The important consequence of Theorem 4.3 is that ||Z!/2g7(67)|| diverges 
to infinity at rate T!/?. Therefore, taken together, Theorems 3.3 and 4.3 imply 
that T!/2gr(Êr) converges to a mean zero normal distribution if the model is 
correctly specified but diverges to infinity if the model is misspecified. This 
property is exploited in the construction of the model specification tests which 
are reviewed in the next chapter. 


4.6 Summary of Consequences of 
Misspecification for GMM Estimation 


It is useful to begin by recalling the properties of the GMM estimator in cor- 
rectly specified models. Since Assumptions 4.1 and 3.1 imply q > p, we confine 
our attention here to the case in which the parameter vector is overidentified. 


Properties of GMM in correctly specified models: 


e Or converges in probability to 0) for any choice of Wr which satisfies 
Assumption 3.7. 


e T'/2(6p — 8o) converges to a normal distribution and the choice of weight- 
ing matrix only affects this distribution in the variance via W. 


e The two step and iterated estimators have the same asymptotic properties. 


e T'/297(67) converges to a mean zero normal distribution. 


In contrast, it has been shown in this chapter that the following properties hold 
in misspecified models. 


Properties of GMM in misspecified models: 


e The probability limit of Êr depends on W in general. 


e The rate of convergence of Êr to its limit, 04, depends on the rate of 
convergence of Wr to W, and the limiting distribution of cr(@p — 0+) 
depends on that of cr(Wr — W). 


e The two step and iterated estimators have different asymptotic properties, 
and the asymptotic distribution of ĝr(i) depends on the estimators from 
the previous steps.?” 


27 This statement excludes the case in which Wr = ee 
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e T!/297(O) diverges. 


So, basically, everything is different. Most importantly, misspecification 
means that in most cases we have not estimated what we anticipated. This is 
sufficient by itself to make all subsequent inferences misleading, and so provides 
a motivation for the model specification tests described in the next chapter. 


5 


Hypothesis Testing 


The previous two chapters describe the behaviour of the estimator and its asso- 
ciated statistics in both correctly specified and misspecified models. The next 
step is to develop inference procedures through which the estimation results can 
be used to learn about the underlying model. There are three broad questions 
which naturally arise in this context — Is the model correctly specified? Does the 
model satisfy restrictions implied by economic/statistical theory? Which of two 
competing models is correct? Within the GMM framework, all these questions 
are addressed via hypothesis tests concerning either population moment condi- 
tions or the parameter vector or both. In practice, these inferences are most 
often — if not always — based on the two step or iterated estimator. Therefore, 
we focus attention exclusively on this case throughout the chapter. 
Misspecification has the potential to make the estimator inconsistent, and 
so to render all subsequent inferences misleading. Therefore, it is prudent to 
begin by testing whether the model is correctly specified. Within our frame- 
work, the economic/statistical model implies that v; satisfies the population 
moment condition E[f(vz,00)] = 0. Since this is the starting point for our es- 
timation, it is clearly desirable to test whether the sample are consistent with 
the hypothesis that this condition holds in the population. In most of the appli- 
cations in Table 1.1, q is greater than p and so the overidentifying restrictions 
are available to form the basis for a test of the model specification. Section 5.1 
extends the earlier discussion of the overidentifying restrictions test to nonlinear 
dynamic models. It also presents a formal analysis of the statistic’s behaviour 
in both correctly and misspecified models. The latter involves two forms of 
misspecification: “non-local” and “local”. It is most common in the literature 
to analyze the power properties of various statistics using local misspecification 
framework. This approach is particularly attractive in cases where more than 
one statistic is available to test a hypotheses because it facilitates a meaning- 
ful comparison of the candidates’ power properties. However we include both 
here because it is only via a non-local analysis that it becomes possible to un- 
cover the dependence of the limiting behaviour of the statistic on the method 
of covariance matrix estimation. This issue is only illustrated explicitly for the 
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overidentifying restrictions test but equally applies to the other tests of model 
specification described below. 

In some cases, a priori information may indicate that the potential misspec- 
ification is confined to certain elements of the population moment condition. 
In certain circumstances, it is possible to exploit this information to construct 
a more powerful test than the overidentifying restrictions test. Section 5.2 de- 
scribes when this is possible and presents statistics for testing so-called hypothe- 
ses about a subset of the moment conditions. If the model is validated by the 
previous test statistics, then it is reasonable to use the estimation results as 
a basis for inference about the phenomena captured by the model. In many 
economic models, these inferences reduce to hypotheses about restrictions on 
the parameter vector. Section 5.3 discusses methods for testing the hypothesis 
that the parameter vector satisfies a set of nonlinear restrictions of the form 
r(69) = 0. These types of restrictions naturally arise in many economic mod- 
els and so test results can often provide useful insights about the underlying 
economic structure. 

One of the main assumptions behind GMM is that the population moment 
condition holds throughout the entire sample; in other words the model is as- 
sumed to be “structurally stable”. A natural concern is whether the popula- 
tion moment condition is only true for part of the sample in which case the 
model exhibits “structural instability”. Section 5.4 describes various methods 
for testing structural stability. The differences between the tests are most eas- 
ily understood by considering their sensitivity to instability of identifying and 
overidentifying restrictions separately. It is also shown how this decomposition 
can be exploited to develop tests which can distinguish between instability in 
the parameters alone and instability of a more general form. 

The foregoing hypothesis tests are by far the most common in the types of 
applications in Table 1.1, and so merit detailed discussion. Section 5.5 provides 
a brief summary of certain other inference techniques which have been proposed 
in the literature. Section 5.5.1 discusses non-nested hypothesis tests, which have 
been proposed as a method of choosing between two competing specifications. 
In some cases, one competing model can be nested within the other and so it 
is possible to assess which is more appropriate using the types of procedure 
described in Sections 5.1 through 5.3. However, in other cases the competing 
models are not nested in this fashion, and so alternative procedures must be 
developed. As will be seen, this type of question is much harder to address 
within the majority of models listed in Table 1.1 without further restrictions. 
Section 5.5.2 describes so-called “Hausman” tests which involve the compari- 
son of two estimators based on different sets of population moment conditions. 
Section 5.5.3 concludes the chapter with a discussion of “conditional moment” 
tests. These tests are commonly employed in models estimated by Maximum 
Likelihood to assess whether the assumed distribution is correct. Although 
Maximum Likelihood is not a focus of this book, these tests are included here 
because they have some important similarities and differences with the other 
procedures discussed above. Section 5.6 concludes with a brief summary of the 
chapter. 
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Finally, two omissions should be noted. First, this chapter focuses exclu- 
sively on the asymptotic properties of these tests. In most cases, the original 
articles did not provide simulation evidence on the finite sample properties of 
their proposed tests. Instead this type of evidence tends to be found in studies 
which sought to examine the finite sample behaviour of all aspects of GMM 
in the context of a particular model. We believe that it is more instructive to 
review these studies in a similar spirit, and so further discussion of this aspect 
of hypothesis testing can be found in Chapter 6. Secondly, it is beyond the 
scope of this book to provide an introduction to the general theory of statistical 
hypothesis testing; this material can be found in many other sources such as 
Lehmann (1959) or Cox and Hinckley (1974). 


5.1 The Overidentifying Restrictions Test 


Section 2.5 introduced the idea of using the overidentifying restrictions to test 
whether the model is correctly specified. Although this earlier discussion is 
in the context of the linear model, the underlying intuition is not specific to 
this structure. In this section we extend the overidentifying restrictions test 
to nonlinear dynamic models and formally analyse its properties in correctly 
specified and misspecified models. There are two main approaches to this type 
of analysis in misspecified models. The first employs the framework in Chapter 
4, which it is now useful to refer to as non-local misspecification. The second 
is based on a local form of misspecification. The distinction between them is 
best motivated by briefly reconsidering the nature of Assumption 4.1. This 
assumption has two important implications. First, there is no value of 6 for 
which E|f (vz, @)] = 0 — that is, the model is misspecified. Secondly, E[f (vz, @)] = 
u(0) — that is, the “size” of the misspecification, (0), is the same for all t, 
regardless of the sample size. In other words, the model is wrong and the 
situation does not change as the sample size increases. This scenario contrasts 
with local misspecification in which the model is misspecified for finite T, but 
the size of the misspecification decreases with T so that in the limit the model is 
correct. This misspecification is “local” in the sense that the data are generated 
by a sequence of processes which become closer and closer to satisfying Ho as 
T increases and in the limit do satisfy this hypothesis. As might be imagined, 
a different analysis is required for each type of misspecification. Therefore, we 
break our discussion down into three parts. Section 5.1.1 introduces the test 
statistic and derives its asymptotic distribution in correctly specified models. 
Section 5.1.2 considers the behaviour of the statistic in non-locally misspecified 
models, and Section 5.1.3 presents its local counterpart. As will be seen, the 
conclusions from these two types of analysis are couched in very different terms. 
Section 5.1.4 concludes the discussion with a demonstration that each form of 
analysis leads to the same qualitative conclusions about the properties of the 
test. 
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5.1.1 The Statistic and its Asymptotic Distribution in 
Correctly Specified Models 


Section 2.5 introduces the idea of using the overidentifying restrictions test 
statistic to assess the adequacy of the model specification. It can be recalled 
that the idea behind the test is simple: if E[z;u:(@0)] = 0 then the estimated 
sample moment, T~!Z'u(6r), should be zero once allowance is made for sam- 
pling error. The same logic can be applied equally in nonlinear dynamic models: 
if E[f (vs, 0o)] = 0 then gr(@r) should be approximately zero. This insight mo- 
tivated Hansen (1982) to propose testing the null hypothesis 


Ho: E[f (v, 9o)] = 0 (5.1) 
using the overidentifying restrictions test statistic 
Jr = T gr(6r) Sp'9r(6r) (5.2) 


where, as a reminder, Êr is the second step (or iterated) estimator. This statistic 
is easily recognized to be the generalization of Sargan’s (1958) statistic (equa- 
tion (2.42) above) to nonlinear dynamic models. Hansen (1982, Lemma 4.2) 
derived its limiting distribution under Ho, and this result is given in the follow- 
ing theorem. 


Theorem 5.1 The Asymptotic Distribution of the Overidentifying Re- 
strictions Test Statistic i 
If (i) Assumptions 3.1-3.5, 3.8-3.13 hold; (ii) Sr is positive semi-definite and 


converges in probability to S; then Jr 4 Xap: 


Proof: 
nA Sr = S it follows from Slutsky’s Theorem (Lemma 1.1) that Jr — 
Jr 2, 0 where 7 x : 

Jr =Tgr(Or)'S~*gr(Or) (5.3) 
Therefore, the theorem can be established by proving that Jr has the stated 
limiting distribution. Using Theorem 3.3 evaluated at W = S~!, we obtain 


Jr $ |l- POo)|nall? = nll — POo)]r-q (5.4) 


where nq denotes a (q x 1) random vector with a standard normal distribution. 
Now I, — P(0o) is a projection matrix whose rank is q — p by Assumption 3.4.1 
The desired result then follows from (5.4) and Rao (1973, p.186). ° 


Notice that Theorem 4.1 holds for any choice of covariance matrix estimator 
which is both positive semi-definite and consistent for S under the assumption 
that the model is correctly specified. This class includes any estimators in Sec- 
tions 3.5 and 4.3 which adequately capture the dynamic structure of f(v:, 0o). 


Recall that Assumption 3.4 implies Assumption 3.6 and hence that rank[F(00)] = p. 


5.1 The Overidentifying Restrictions Test 145 


Although we have stated Theorem 5.1 in terms of the two step or iterated 
GMM estimator, intuition suggests a similar result holds for minimand of the 
continuous updating estimator.? In fact, the proof of Theorem 5.1 is easily 
adapted to show that under Ho 


Jeont,T = TQeont,T (Pr) = ee (5.5) 


where - with an abuse of notation - Êr is the now the continuous updating esti- 
mator.? However, while the asymptotic distributions of Jeont,7 and Jr are the 
same, the numerical values differ in a predicatable way under certain circum- 
stances. Specifically, if Jr is based on the iterated estimator and these iterations 
converge, then it follows from the definition of the continuous updating estima- 
tor that Jeont,r cannot exceed Jp 

This statistic has become a standard diagnostic for models estimated by 
GMM and is routinely calculated in most computer packages. In Section 2.5, we 
discussed the interpretation of this test in general terms. We now complement 
those earlier remarks with a more formal analysis of the statistic’s behaviour in 
misspecified models in Sections 5.1.2 and 5.1.3. 


5.1.2 Non-Local Misspecification 


Our analysis of GMM in misspecified models is premised on Assumption 4.1.° 
As mentioned above, this misspecification is refered to as “non-local” because 
the “size” of the misspecification, u(0), is the same for all observations and 
sample sizes. Intuition suggests that if the model is wrong for every observation 
then the evidence against it must mount up as the sample increases with the 
result that the model is rejected with probability one in the limit. In essence this 
intuition is correct, but there is an important caveat concerning the calculation 
of the covariance matrix. The analysis in this section is based on Hall (2000). 
Before, we present the more formal analysis, it is useful to develop a heuristic 
understanding of the way in which the covariance matrix estimator can play such 
a crucial role. Recall that the overidentifying restrictions test is a quadratic form 
in T'/2g7(6r) and Sp‘. Theorem 4.3 indicates that T!/2g7(67) diverges under 
non-local specification, Intuition suggests that this behaviour is inherited by Jr 
provided Se converges in probability to a positive definite matrix. However, 
it can be recalled from Section 4.4 that the inverse of certain covariance matrix 
estimators — Bay ac in particular — only converge to a positive semi-definite 
matrix in misspecified models and in these cases it is no longer so obvious that 
Jr diverges. For this reason, it is most convenient to separate our analysis into 
two parts depending on the limiting behaviour of Se 1. There is one other aspect 
of this heuristic discussion, which should be noted. We have made no mention 


2 See Section 3.7. 

3 We omit the details for brevity. See Hansen, Heaton, and Yaron (1996) for further 
discussion. 

4 See Section 6.3 for further discussion. 

5 See Chapter 4. 
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of whether Êr is a consistent estimator of S,. The key issue is only whether or 
not plimp_.oo ÔF l is positive definite for which consistency is sufficient but not 
necessary. 

We begin with the more standard case in which 53) converges to a positive 
definite limit. Inspection of Section 4.3 reveals that this case covers the estima- 
tors: S'sy, and the versions of the covariance matrices based on f(v:,9)—gr(0).® 
For this analysis, we require the second step estimator to converge in probability 
to some constant limit. Below we impose this condition directly for simplicity 
because more primitive conditions depend in part on the covariance matrix es- 
timator; see Chapter 4.” 


Theorem 5.2 Large Sample Behaviour of Jr: Part (i) 

If (i) Assumptions 3.1, 3.2, 3.8-8.10, 4.1 and 4.3 hold; (it) ce satisfies As- 
sumption 3.7; (iii) br > O, for some 0, € ©; then: T~!Jp & c where 
0< c< œ and so limp. P|Jr > Ca] = 1, where ca is the 100(1 — a)*” 
percentile of the Xo distribution. 


The basic outline of the proof has been anticipated above, but for completeness 
we now fill in the details. 

Proof: 

Let W denote the probability limit of Sa and px = Elf (vz, 0.)]. From Theorem 
4.3 and Slutsky’s Theorem (Lemma 1.1) it follows that 


T lJp = p,W us + op(1) (5.6) 


Since W is positive definite and u, 4 0 by Assumption 4.1, it follows from (5.6) 
that T7'Jp È c = u,Wu, > 0. Therefore, Jp = Te + Op(T) increases at rate 
T and so tends to œ in probability as T — co, which gives the desired result. 
© 


In statistical parlance, Theorem 5.2 states that Jr is a consistent test of 
Ho : Elf (vt, 00)] = 0 against the alternative that the data satisfy Assumption 
4.1.8 

We now consider what happens if Szo is used as the weighting matrix on 
the second step. It can be recalled from Lemma 4.3 that Sac converges in 
probability to a positive semi-definite matrix and that the form of this limit 
has important implications for the two step estimator. We now establish that 
this limiting behaviour also has important consequences for the behaviour of 
the overidentifying restrictions test. 


6 SV ARMA is omitted from this list because, at time of writing, its limiting behaviour in 
misspecified models is unknown; see the discussion in Section 4.3. 

T For the purposes of comparison with Chapter 4, note that here we suppress the (2) index 
on both Or and 6, for ease of notation. 

8 This is a somewhat unfortunate terminology since we have already used the term con- 
sistency to refer to a property of an estimator. However, the meaning should be obvious from 
the context. 
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Theorem 5.3 Large Sample Behaviour of Jr: Part (ii) _ ; 

If: (i) Assumptions 3.1, 3.2, 3.8-3.10, A L 4.8 and 4.4 hold; (ii) Sr = Syac = 
S.4+Br pall, +0p(1) where Br = 1+2 51 wir, Br = O(br), imr Br/br # 
0 and the bandwidth satisfies br —> 00, br = o(T'/2); then: Jr = O,(T/br). 
Proof: 

Let the minimand on the second step GMM estimation be Qe ) (0). By definition 
Qr(6r(2)) < Qr(Ox), and so it is sufficient to prove that TQr(4,.) = Op(T/br). 
By the Cauchy-Schwarz inequality? and condition (ii), we have 


TOP 0.) < |T/br|lbr/Brl|BrQy (6s)! (5.7) 
Since T/br = O(T/br) and condition (ii) implies br/Br = O(1) we concentrate 
here on showing that BrQy? (0 6,.) = O,(1). Since 

BrP (0) = By!*gr(0.)' Sp" Bz! gr(O.) 

1/2 


we first consider B7/“gr(0.). By definition, we have 


E 
By’ gr (04) = Bi us + (Br/TYPT Y [flos 0s) — x] (5.8) 


t=1 
Now Lemma 4.1 implies that T~!/? 57 [f (v1, 02) — u+] = Op(1). Furthermore 


we have assumed that bp = o(T!/?), and so (5.8) implies BH’ or (Os \ By at 


0,(1). Therefore, it follows that 
BrQr(O.) = Bru,ôp us + op(1) (5.9) 
Bry, Spe + Brp,(Sp'—Sp')te + op(1) (5.10) 


Using (4.53) it can be shown that 


II 


, Bry, 97 ux 
Bru, Spi pf. = —— = = ol 5.11 
ios eam S (1) oe 


Now consider the second term in (5.10), that is 
Bru, (Sp — Sp')px = Bru, Sp'(Sr — Sr)Sp pe =na,r, say 


From (4.54)—(4.55), it follows that nər = op(1), and so, using this result with 
(5.11) in (5.10), we have BrQr(6.) = O,(1). The desired result then follows 
from (5.7). ° 


Theorem 5.3 indicates that Jr cannot increase at a faster rate than T/br 
when Wr = Se age By itself, this result does not imply Jr increases at that 


9 See Apostol (1974) [p.294]. 
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rate, although this is in fact the case. Therefore, the overidentifying restrictions 
test is still consistent. 

Together Theorems 5.2 and 5.3 indicate there is a difference in the rate 
at which Jr diverges depending on how the HAC is calculated. If a centred 
HAC estimator is used then Jr increases at rate T, but if an uncentred HAC 
estimator is used then Jr increases at rate T/br. Notice if we use an uncentred 
HAC with the optimal bandwidth then there is also a difference in the rate of 
increase of Jr between the kernels.'° With the Bartlett, Jr increases at rate 
T?/3, whereas with the Parzen and Quadratic Spectral kernels, Jr increases at 
rate T4/5, Hall (2000) provides simulation evidence which illustrates that the 
failure to centre the HAC can have a substantial impact on the magnitude of 
the statistic in finite samples as well. It is less clear whether this difference 
in rates also manifests itself in differing power properties for the two versions 
of the test. For power calculations at a fixed significance level, it is only the 
magnitude of the statistic relative to the critical point which matters. Intuition 
suggests that there may be circumstances in which the two versions of the tests 
have different finite sample power properties but this remains an open research 
question. However, the rate of increase is important for the construction of 
moment selection procedures based on the overidentifying restrictions test; see 
Section 7.3.1. 


5.1.3 Local Misspecification 


So far our analysis has considered the scenarios in which the model is either 
correctly specified or subject to non-local misspecification. The contrast be- 
tween these two is stark. If the model is correct then the following holds: (a) 
the population moment condition is true for all t; (b) the parameter estimator 
is consistent; (c) T!/2gr(6r) converges to a mean zero normal distribution; and 
(d) it is only necessary to capture the dynamic structure of f; to construct a 
consistent estimator of the long run variance. In contrast if there is non-local 
misspecification then: (a) the population moment condition is invalid for all t; 
(b) the parameter estimator is likely to be inconsistent; (c) T!/2gr (Êr) diverges; 
and (d) the construction of a consistent covariance matrix estimator must ac- 
count for both the non-zero mean and the dynamic structure of f;. In this 
section, we move to a third scenario which lies between these two extremes. Lo- 
cal misspecification captures the case where the population moment condition is 
invalid for any finite T but the size of the violation is O(T7"?) and so disappears 
in the limit. This rate of decrease ensures the misspecification does not affect 
the probability limits of either the parameter or covariance matrix estimators, 
but does manifest itself in the mean of the limiting distribution of T'/?g7 (60) 
and consequently the asymptotic distributions of the estimator and estimated 
sample moment as well. Newey (1985a) was the first paper to present an analy- 
sis of the overidentifying restrictions test under local alternatives. However, we 
take a different approach to the construction of local misspecification which was 


10 See Table 3.4. 
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first exploited in this context by Hall (1999). The qualitative conclusions are 
the same as Newey’s (1985a) but the route to them is slightly different. 

To introduce the local misspecification framework, it is most convenient to 
begin with the transformed population moment condition introduced in Section 
3.3. Since two-step estimation involves W = S71 on the final step, we express 
the hypothesis that Assumption 3.3 holds by 


Ho: S7! Elf (u,,00)] = 0 (5.12) 


The advantage of this approach is that it allows Hp to be decomposed into hy- 
potheses about the identifying and overidentifying restrictions. To this end, we 
once again set P(@) = F(0)[F(0) F(0)|-1F (0) where F(0) = S~!/2E[Of (v+, 0) / 
6’). It then follows from (3.19)-(3.20) that 


Ho: Hi & He 
Hg: P(00) STP E[S (ve, O0)] = 0 
HE: — [Iq — P(60)| SP E[S (vz, 80)] = 0 


where Hd, HỌ are respectively the hypotheses that the identifying and overi- 
dentifying restrictions hold at 69. Since the transformed population moment 
can always be decomposed into 


S"? ELF (v, 0)] = P(0) SELS (on 0)] + Ua — P(0)] SELF (ot, 8)] 
(5.13) 
we can characterize the local misspecification in terms of violations of the iden- 
tifying and overidentifying restrictions. To this end, we introduce the following 
sequences of local alternatives to Ht and HE 


Hir: P(00) SP Er[f (vt, 90)] = T72 P(00) nr = Tur 
Hp: [Hy — P(00)]S~/? Er [f (vt, 90)] = T7" — P(80) Jno = Tuo 


in which ur 4 0, uo # 0 and Er|.] denotes expectations with respect to the joint 
probability distribution of {v;;t = 1,2,...T}. The reason for this subscript 
on the expectation operator is discussed below, but first we briefly consider 
the nature of these two alternatives. Notice that under H dos the identifying 
restrictions are violated for finite T, but the “size” of this violation decreases 
as T increases and disappears in the limit as T — oo. Clearly, Ho implies a 
similar pattern of violations of the overidentifying restrictions. This technical 
device for constructing local alternative hypotheses is known as Pitman drift 
after Pitman (1949) who first introduced it.1' As mentioned above, equation 
(5.13) can be used to combine these two sequences into a sequence of local 
alternatives to Ho, that is Har = H4 r & HÌ r. 


11 Edwin Pitman (1897-1993) was an Australian statistician who made a number of con- 
tributions to statistics including the eponymous efficiency measure. The 1949 reference is to 
a set of lecture notes prepared for a lecture series given at the University of North Carolina, 
Chapel Hill and also elsewhere in the U.S. Although not published at that time, the notes 
were widely circulated and played an influential role in the development of statistical theory. 
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One immediate consequence of local misspecification is that the data v, 
cannot be a realization from a strictly stationary process because the value of 
E|f (vt, 90)] changes with T. This means the probability distribution of the data 
depends on T, and so the sample is now a realization from a doubly indexed 
process {v r,t = 1,...T;T = 1,2,...}'* and it is for this reason that we in- 
troduced above the subscript T for the expectation operator. It is possible to 
develop characterizations of the probability distribution of the data which lead 
to Ha,r; for example, see Newey (1985a). However we do not pursue that route 
here and choose instead to characterize the data generation process implicitly 
via the properties which play a part in the analysis. Intuition suggests that 
Hy ,7 causes a relatively modest perturbation from stationarity, and it is rea- 
sonable to assume there are data generation processes which satisfy the following 
assumption. 


Assumption 5.1 Data Generation Process under H,4 7 

The observed data are assumed to be a realization from a stochastic process 
{vt = 1,2,...} which satisfies the following conditions: (i) Ôr > 6; (ii) 
gr(Or) > 0; (iii) Gr(Or) > Go, Gr(Or, bo, Ar) S Go; (iv) Sp È S, a positive 
definite matriz; (v) S~\/?T/2g7(o) 4 N(ur+ no, Iq). 


So for our purposes, the only effective difference between the data generation 
processes under Ho and H4,r is in the mean of the limiting distribution for 
Tgp (8o). 

Before we analyze the behaviour of the overidentifying restrictions test, it 
is instructive to consider the impact of local misspecification on the asymptotic 
distribution of the parameter estimator. Since Ôr S 09, we can use (3.24)-(3.26) 
in order to establish the following result. 


Lemma 5.1 The Asymptotic Behaviour of T!/2(67 — 6o) under Har 
If Assumption 5.1 holds then: 


T"? (ôr — 6) 4 N ( - (G87 Go) GaS"? nr, (GaS Go) =) 


There are two aspects of this distributional result which should be noted. 
First, a comparison with Theorem 3.2 reveals that local misspecification only 
impacts on the mean of the distribution. Secondly, this impact derives from 
H A alone. This conforms to our earlier comments about the different roles of 
these two sets of restrictions.1? A local violation of the identifying restrictions 
causes a bias in the asymptotic distribution of Or away from 69, but a local 
violation of the overidentifying restrictions has no impact. 

With this in mind, we now characterize the behaviour of the overidentifying 
restrictions test under Hy r. 


12 Such a process is called a triangular array; see Davidson (1994) [pp.34, 178]. However, 
for notational simplicity, we suppress the additional subscript on v. 
13 See Section 3.3. 
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Theorem 5.4 Large Sample Behaviour of Jr under Hy 7 


If Assumption 5.1 holds then: Jr s XŽ—p(HoLo) where x2 (b) denotes a x? 
distribution with degrees of freedom a and non-centrality parameter b.14 


Proof: 

Once again, it suffices to consider the statistic Jp = Tgr(6r) So} gr(Or). The 
first few steps of the argument are identical to the analysis of T!/2g7(47) in the 
proof of Theorem 5.1. The Mean Value Theorem can be used to deduce (3.34) 
and this in turn leads to (3.35). Since Assumption 5.1 implies the matrices 
in (3.35) converge to the same limits under Hp and Har, equation (3.35) is 
equivalent to 


STT"? 976) = [Iq — P(00))S~1/?T/?.g7(80) + op(1) (5.14) 
Using (5.14) and Assumption 5.1, it follows that 
~ d F 
Jr > \\[Iq—P(90)|(mqt+Hrt+Ho) |? = (nturtuo) [q—P(0)|(mq+ Hr + Ho) 


5 (5.15) 
where ng ~ N (0, I4). Equation (5.15) implies Jr converges to a XZ—p(b) distri- 


bution where b = (ur + po) [Iq — P(80)] (ur + no). However, since ur = P(0o)nr 
and uo = [Iq — P(80)|no, the non-centrality parameter reduces to b = Hoho. 
© 


Theorem 5.4 reveals that the non—centrality parameter depends on uo alone, 
and so the test only has power against local violations of the overidentifying 
restrictions. This implies that if the local misspecification is confined to the 
identifying restrictions then Jr converges to a central Xe distribution. There- 
fore, the test has the same distribution under both Ho and H Ler & HE , and so 
cannot be used to discriminate between these two states of the world. 


5.1.4 The Parallels Between Non-Local and Local 
Analysis 


As we have just seen, very different techniques are required for the analysis of 
the test’s behaviour in the presence of local and non-local misspecification. At 
first glance, it is not immediately obvious that they lead to the same conclusions 
about the interpretation of a significant statistic — but they do! Since the test 
is the standard diagnostic within the GMM framework, it is worthwhile briefly 
explaining the parallels between the two types of analysis. 

In the preamble to Chapter 4, we introduced three models: the assumed 
model M, and two alternative candidates for the true model M4 and Mp. 
These models have the following properties: 


M = Elf (vt, 90)] = 0 for some unique 05 € © 


14 See Johnson and Kotz (1970) [Chapter 28] for a review of the properties of the non-central 
x? distribution. 
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Ma => Elf (v:,0+)] =0 for some unique 6, € © 
Mp = A¢€Osuch that E[f(v,0)| =0 


If M is misspecified then whether or not we can detect this fact using Jr de- 
pends on whether M4 or Mg represents the truth. If Mp is true then the 
assumed population moment condition is subject to non-local misspecification. 
In this case, Jr is consistent against this alternative, and so leads to rejection 
of the model with probability one in the limit. Now suppose M 4 represents the 
truth. Thus far, we have not explicitly considered this case, but it is easy to see 
what happens. Both M and Ma imply there is a unique value of 0 at which 
the population moment condition is satisfied. Since neither model places any 
further restrictions on this unique value of 0, they are observationally equivalent 
on the basis of E[f (v+, 0)]. Therefore, the estimator and all its associated statis- 
tics behave exactly the same under both M and My. So this type of model 
misspecification cannot be detected — for the good reason that the part of the 
model used in estimation is actually correct! 

This behaviour is mirrored in the analysis under local misspecification. The 
local alternative H} p & HỌ corresponds to a local version of M4. To bring 
out this connection, it is necessary to consider a local alternative to Ho in which 
the population moment condition is satisfied at a sequence of parameter values 
that converges to ĝo, that is 


Hir: STP Er[f (wn 8r) = 0 (5.16) 


where 67 = 09 +T~!/2np. Given the local nature of the alternative, it is possible 
to use a first order Taylor expansion in (5.16) to deduce that H} p implies 


S12 Er f(v;,00)| + T71/2F(00)np = 0 (5.17) 
Since F'(@)) = P(0o)F (00), equation (5.17) can be rewritten as 
ST? Er f(v,90)] = -TPF (0o)}np = TO? P(6o)nr 


where nz = —F(6o)np. It is then immediately apparent that Hh r = Hh r & HP. 
In other words, H4 p & HẸ can be characterized as a sequence of alternatives 
in which the population moment condition is satisfied at a unique parameter 
value for each T, and so each member of the sequence satisfies the definition 
of Ma. Theorem 5.4 states that Jr has the same distribution under both Ho 
and H We & HY, and so can now be recognized as the precursor to our com- 
ments above about the statistic’s behaviour under M4. Since H ae implies 
that 571/2 Er[f (vt, 0o)] lies in the column space of F (6), it follows that HQ + 
implies the data are generated by a sequence of models with the properties of 
Me. So, Theorems 5.2 and 5.4 represent two ways of saying that the test can 
be used to discriminate between M and Mp. 

To conclude this discussion, it is useful to bring one implicit assumption into 
the light. Throughout, it has been assumed that the estimation really did locate 
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the global maximum of Qr(0). While, this is a reasonable assumption to make 
for the theoretical analysis, it may not be such a trivial issue in practice as we 
discussed in Section 3.2. Andrews (1997) observes that a significant statistic 
may be attributable to the failure of the estimation routine to locate the global 
minimum. In fact, Andrews (1997) proposes a method based on Jr to determine 
whether the global minimum has been reached. However, we do not pursue the 
details here, because this approach confounds issues of numerical convergence 
and model specification. However, Andrews’s observation does re-emphasize the 
importance of locating the global maximum. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

Table 5.1 reports the overidentifying restrictions test statistics based on both 
the two step and iterated GMM estimators. For brevity, we only report re- 
sults for the case in which the first step weighting matrix is the inverse of the 
instrument cross product matrix. Two choices of covariance matrix estimator 
are used: § su and Ssuy- However, in this case, the conclusions are affected 
by neither iteration nor the choice of covariance matrix estimator. The model 
is rejected with equally weighted returns (EWR) but cannot be rejected with 
value weighted returns (VWR). For the record, we also note that the same con- 
clusions are drawn using the continuous updating estimator described in Section 
3.7. In each case, the J—statistic based on the continuous GMM estimator is only 
marginally smaller than its counterpart based on the iterated estimator. © 


Table 5.1 
Overidentifying restriction test statistics 


Asset Sr Statistic Two-step Iterated 
EWR Ssu Jr 11.645 11.810 
p — value 0.009 0.008 
Ssup Jr 11.945 12.116 
p — value 0.008 0.007 
VWR Ssu Jr 1.747 1.748 
p — value 0.626 0.626 
Ssup Jr 1.754 1.755 
p — value 0.625 0.625 


Notes: Sou, Ssuyu are given in (3.40) and (4.24) respectively, Jr denotes the overidentifying 
restrictions test in (5.2) and p-value denotes the observed significance level of Jr. 


5.2 Testing Hypotheses about Subsets of 
Elf (vt, 9) 


The vector of population moment conditions can often be partitioned into a set 
of sub-vectors each of which refer to a different aspect of the model. In some 
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cases, a priori information may indicate that if there is misspecification then it 
is confined to a particular part of the population moment condition so that the 
true model, M g,s, would have the property, 


Mes => E|fi(ve,@0)] = 0 for some unique ĝo € © but E[fo(vz, 0)] 4 0 

(5.18) 
for some partition f(vs, 0) = [fi(ve, 0)’, fo(vr,0)']’.. Since Mpg C Mp, the 
overidentifying restrictions test is consistent against this type of misspecifica- 
tion. However, it is possible to construct a more powerful test of the model 
specification by taking advantage of the a priori information on the likely source 
of the misspecification. In this section, we present this test and analyze its prop- 
erties under local forms of misspecification. 

To begin, it is necessary to define the partition of f(.) more formally and 
also introduce a partition of 0. Let 0) = (951,92) where ĝo,i is (pi x 1), 
and f(vz, 00)’ = [fi (vt, 90,1)’; fa(ve, 00V] where fi(.) is (qi x 1). Without loss of 
generality we focus on the case in which it is desired to test the null hypothesis 


HŠ : Elfi(v:,90,1)] = 0 and Elfe(v:,40)] = 0 (5.19) 
against the alternative that 
H3: Elfi(v:,901)] = Oand Elfe(v:,40)] 4 0 (5.20) 


Two features of this specification should be noted. First, the veracity of 
E{fi (vz, 60,1)] = 0 is maintained under both null and alternative; so the poten- 
tial misspecification is confined to E[f2(v:z, 0o)]. Secondly, this framework allows 
for the possibility that the maintained moment conditions, E[ fı (vz, 00,1)] = 0, 
only depend on part of the parameter vector. 

Both Newey (1985a) and Eichenbaum, Hansen, and Singleton (1988) have 
proposed methods for discriminating between these two hypotheses. Although 
these authors take very different approaches, Ahn (1995) shows their resulting 
statistics are asymptotically equivalent under both Hj and local versions of 
HÌ. From a practical perspective, Eichenbaum, Hansen, and Singleton’s (1988) 
statistic is far easier to calculate, and so we concentrate exclusively on this test. 
Readers interested in the approach taken by Newey (1985a) are refered to his 
original paper or the discussion in the review article by Hall (1999). 

Eichenbaum, Hansen, and Singleton’s (1988) statistic is so convenient be- 
cause it is simply the difference between two overidentifying restrictions tests. 
The first is the overidentifying restrictions test from GMM estimation based on 
the full set of population moment conditions, Jr in (5.2). The second is the 
overidentifying restrictions test associated with GMM estimation of 9,; based 
on the moment conditions maintained under both H and H$, that is 


Jir = Tg rlr) S igre, r) (5.21) 


where fir is the two step (or iterated) GMM estimator of 69,1 based on E[f (vz, 
90,1)| = 0, gi,r(01) = T7} Sak fi(vi,01), and S11 is a consistent estimator of 
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S11 = limmo Var[T~/? E} fi (vr, 90,1)|- Eichenbaum, Hansen, and Single- 
ton’s (1988) statistic is then given by, 


Cr = Jr- Jr (5.22) 


The intuition behind the test’s construction is most readily appreciated after an 
exploration of its properties, and so we now proceed to the statistical analysis, 
but return to the intuition at the end of this section. 

We begin with its limiting distribution under Ho. It is clear from the struc- 
ture of the statistic that most of the regularity conditions are going to be the 
same as for the corresponding result for the overidentifying restrictions test in 
Theorem 5.1. However, Cr also depends on a second GMM estimation us- 
ing E[f1(v:,0,1)] = 0 alone, and so it is necessary to introduce the following 
identification condition.!° 


Assumption 5.2 Identification Condition for 491 
EA fı (vt, 90,1)/004] has rank pı. 


Notice that this assumption implies qı > pı. Clearly, if q = pı then Jj,r = 0 
and Cr reduces to Jr; therefore it is assumed below that qı > pı. It must also 
be the case that q2 > pz otherwise there be a value of 99,2 which sets E[f (v+, 0o)] 
equal to zero for any given value of ĝo,1.16 


Theorem 5.5 The Asymptotic Distribution of Eichenbaum, Hansen, 
and Singleton’s (1988) Statistic under Hj 

If (i) Assumptions 3.1-3.5, 3.8-3.18 and 5.2 hold; (ii) q1 > pı, q2 > pa; (iii) 
S14 is positive semi-definite and converges in probability to S11; (iv) Sr is 


positive semi-definite and converges in probability to S; then Cr 4 X2 Ipi 


The proof is somewhat involved and so is relegated to the technical details 
sub-section at the end of this section. 

There is an interesting pattern to the degrees of freedom of Jr, Ji,r and 
Cr. Theorem 5.1 implies that Jr and Jı,r have q — p and qı — pı degrees 
of freedom respectively. Theorem 5.5 implies that Cr has q2 — pg degrees of 
freedom. Therefore, the subtraction of Jı,r from Jr has created a statistic, 
Cr, with (qı — pi) fewer degrees of freedom. Notice that the resulting degrees 
of freedom equal the degree to which 60,2 is overidentified by E[fa(vt, 0o)] = 0 
given 8,1. 

At the beginning of this section, it is stated that it is possible to use in- 
formation on the nature of the misspecification to construct a more powerful 
test than Jr. It is now time to show that Cr fulfils this promise. To do this, 
it is necessary to move into a setting in which the true model satisfies H$. 
In Section 5.1, we introduced two frameworks for analyzing the behaviour of 
test statistics in misspecified models: a non-local and a local analysis. In that 


15 See Section 3.1 for a discussion of identification. 
16 This follows from the assumption of stationarity by the same logic used to deduce that 
Assumption 4.1 implies q > p; see the preamble to Chapter 4. 
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context, it is shown that either framework can be used to delineate the class of 
alternatives against which Jr has power. However, the non-local framework is 
not well suited to the question at hand here because the end result is that Cr, 
like Jr, rejects H with probability one in the limit.'7 While useful to know, 
this does not help us to characterize which is more powerful. In contrast, the 
local framework is carefully constructed so that the test statistics converge in 
distribution. Since the end product is a distribution, it is possible to compare 
the power properties of two statistics within this framework, and so this is the 
approach we take. 

Section 5.1.3 presents a local power analysis of the overidentifying restric- 
tions test. In that earlier context, it is instructive to set up local alternatives 
using the identifying and overidentifying restrictions. However, that approach 
is less convenient here. Instead, we consider the following sequence of local 
alternatives to Hj, 


Ut, 8 0 = 
ries n[ HOt) | -[ i, Jee oa 


For brevity, we confine ourselves to a heuristic comparison of the distributions 
of Jr and Cr under Hp. First, recall from Section 5.1.3 that for our purposes 
there is only one important difference between the data generation processes 
under the null and local alternatives and that is in the limiting distribution of 
sample moment. Under H con we have 


SPT? 97(6) S N(S~V?us, Iq) (5.24) 


and it is shown in the technical details sub-section at the end of this section 
that this behaviour translates into 


Jr 
Cr 


d 
> Xps) (5.25) 
d 
=> eee (vz) (5.26) 
where vy = peso? Ih, — P(6)|S~'/?4g. Therefore, the only difference be- 
tween the limiting distributions is in the degrees of freedom. If vy > 0 then Cr 
is the more powerful test because it has fewer degrees of freedom.'® 1° 

The foregoing discussion gives a useful perspective on the construction of the 
test. We can think of the overidentifying restrictions based on E|f(vz,90)] = 
0 as being built up of two components. The first component is the set of 


17 Cr is a consistent test of HŠ against H$. The essence of the proof is quite simple. 


Since Mg, s C Mp it follows from Theorem 5.2 that TlIp 2 cs > 0 under HĪ. Also, 
E[fi(ve,90,1)] = 0 under HÎ and so from Theorem 5.1 that Jı,r = Op(1). Taken together 
these two properties imply: T~!Cr 2 cs, and hence is consistent. 

18 From the analysis in Section 5.1.3, it follows that vy > 0 if the data are generated by a 
sequence of processes which satisfies both H 3 r and H A T: 

19 See Johnson and Kotz (1970) [Chapter 28] for a discussion of the properties of the non- 
central x? distribution. 
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qı — pı overidentifying restrictions for 09, based on E[f1(v:z,90,1)] = 0; the 
second component is the set of q2 — p2 overidentifying restrictions for 09.2 based 
on E|fo(vz,00)] = 0 given 00,1. Each component contributes to the degrees 
of freedom of the test, but it is only the second which contributes to the non- 
centrality parameter under H$ p. This structure is exploited in the construction 
of Cr because the statistic is effectively calculated by subtracting from Jr the 
part which is insensitive to the misspecification under H4,r. 

Although the statistic has been motivated as a test of a subset of the popula- 
tion moment condition, the null hypothesis, Hj, involves both E[f1 (vz, 90,1)] = 
0 and E| f2(v4,90)] = 0. Therefore, the test is potentially sensitive to misspecifi- 
cation of any part of the population moment condition. Therefore, the veracity 
of the a priori information is crucially important in the interpretation of a sig- 
nificant statistic. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

In Section 5.1 it is shown that the use of the overidentifying restrictions test 
leads to rejection of the model with equally weighted returns (EWR) but not 
with value weighted returns (VWR). We now investigate the specification of the 
model with VWR further using Eichenbaum, Hansen, and Singleton’s (1988) 
statistic. It can be recalled that our estimation employs an instrument vector 
which contains an intercept and lagged values of both consumption growth and 
the asset return. It may be possible that the moment conditions associated with 
either of the latter two variables are incompatible with the data but this was not 
detected with the overidentifying restrictions for the types of reason described 
above. This possibility leads us to consider two versions of Eichenbaum, Hansen, 
and Singleton’s (1988) statistic. To introduce the associated null and alterna- 
tive, we set Zit = (Ct /Ct—1, Ct—1/Ct—2) and Žo = (r4 /Pt-1,Tt—1/Pt-2). The first 
version tests whether the moments associated with consumption growth are 
compatible with the data and so the null and alternative are given by (5.19)— 


(5.20) with 
hto) = | | (Oo) 
falve, 0o) = %,2ux(0) 


The second version tests whether the moment conditions associated with the 
asset return are compatible with data, that is Hj and HÌ in (5.19)-(5.20) with 


fto = |, [e 


, 


folvt, 0o) = z2u(0o) 


The results are given in Table 5.2. In each case the long run variance is estimated 
using Sz and the statistics are based on the iterated estimator. Clearly, neither 
test offers evidence against the specification in this case. 
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Table 5.2 
Eichenbaum, Hansen, and Singleton’s (1988) statistics for the 
consumption based asset pricing model 


Statistic d.f. Al Z2 
Jr 1 1.241 0.384 

p — value 0.265 0.536 
Cr 2 0.503 1.363 

p — value 0.778 0.506 


Notes: z; denotes the choice of instrument in f2(vz,0), Jı,r denotes the overidentifying re- 
strictions test in (5.21) Cr denotes the overidentifying restrictions test in (5.22) d.f. denotes 
degrees of freedom and p-value denotes the observed significance level. 


5.2.1 Technical Details 


I: Proof of Theorem 5.5: 
Before we begin, it is useful to introduce the following partition of G(0) = 
E|Of (vz, 6) /06"] into four blocks conforming to the partitions of f(.) and 9, 


G0) = Gi(A) | _ | Girl) Gi,2(9) 
G2(0) G21 (0) Ge2(9) 
where G; j = E[Of;(v:,0)/00;]. 
In view of conditions (iii) — (iv) of the the theorem, it suffices to consider 
Cr = T{gr(6r)'S~'gr(6r) — grl r) Sigrar) } (5.27) 


Since the proof is quite long, it is useful to present an overview of the proof 
strategy. There are three main steps. 
Step 1: It is shown that 


S272 976) = AST PT?gr(8o) + op(1) (5.28) 
SPT? g (O17) = A287 T"?gr(00) + op(1) (5.29) 


II 


for certain matrices of constants A and A», and hence that 
Cr = Tgr(00)'S~/7[A,A1 — AyAo|S~/?2gr (00) + op(1) (5.30) 


Step 2: It is shown that A, Ay — A,A? is idempotent with rank qo — po. 
Step 8: Steps 1 and 2 can be combined with the Central Limit Theorem to 
derive the stated result along similar lines to the proof of Theorem 5.1. 

Since Step 3 is straightforward, we concentrate purely on Steps 1 and 2 
below. 
Proof of Step 1: The definition of A, in (5.28) is straightforward because (3.36) 
implies 


S27 1/297(67) = [Iq — P(00)|S~1/?T/2.gr(00) + op(1) (5.31) 
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and so A, = [I, — P(@)]. The definition of A» in (5.29) requires a little more 
work. Since T2917 (61.7) is also an estimated sample moment — this time from 
the estimation of 09,1 based on E[f1(v;,0,1)] = 0 — we can appeal once again 
to (3.36) and deduce that 


SPT? g r,r) = [Iq - P: (00,1) S71 PT"? g1,r (80,1) + op(1) (5.32) 
where Pı (00,1) = F; (00,1) [F11 (00,1) Fi,1 (00,1) Fi, 1 (00,1) and Fi (90,1) = 
ST’ Gi, (80,1). Now, since 

SPT 91 7Go1) = Sa Ua basal AS Tor (833) 
where 0g, xq is a (q1 X q2) null matrix, it follows that (5.32) can be rewritten as 

S Tg rlr) = Un — Pi(00,1)E97™Tgr(00) + op(1) (5.34) 


where E = S11 Ug : Og1xg2]51?. A comparison of (5.29) and (5.34) indicates 
that Ag = Ho gei P, (60,1) |. 
Proof of Step 2: First notice that Ai is idempotent and 


A, Ao = © (Ig, — P(o) = B, say. (5.35) 


So to complete this step of the proof, it is necessary to show that (i) Ai — B is 
idempotent, and (ii) rank(A, — B) = q2 — po. 
Consider (i) first. Using the idempotency of Aj, it follows that 


(A, — B)\(A, — B) = A, — BA, — AiB + BB (5.36) 


We now show that BA, = A,B = BB = B, and so that the right hand side of 
(5.36) reduces to A, — B which is the desired result. First, notice that from the 
definition of A; we have that 


BA, = Bllg— P(%)] = B- BP(80) 
So BA, = B if BP(6) = 0. This latter result is established by observing that, 
BP(00) = BF (60)[F (00)’ F(90)]~*F(G0)' 
and 
BF(%) = = [lq — P1(601)|EF (60) 
Ma, — Pi (90,1)|F1,1 (0,1) 


= 0 


A similar argument can be used for A; B = B, and so we now consider BB. By 
definition, it follows that 


BB = © [Iq — Pi(90,:)|E= Hy — Pi (80,1)]2 (5.37) 
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Since I}, — Pı (80,1) is idempotent, and 
se’ = S SaS” = In, (5.38) 
that BB = B. 


Since A; — B is idempotent, it follows that rank(Aı — 
° Furthermore, it can be shown that trace(A; — B) = 


it follows from (5.37 
Now consider (ii 
B) = trace(A, — B 


trace( A1) — trace(B).?! These two traces can be deduced as follows?” 
trace(A;) = trace(I,) — trace{P(6)} 
= q — trace{F(4o)' F(90)[F 40)’ F(90)]-*} 
= q-—trace(Ip) = q-p 
and?’ 
trace(B) = trace{=[Iq, — Pi(O Jz} 


trace{= = =la — Pi(80)]} 
= trace[I,, — Pi(9)| = qı — pı 


Il 


Taken together, these two results imply 

rank(Aı — B) = trace(Aı) — trace(B) = q — p— (qı — pı) = Q2 — pe 
which completes Step 2 of the proof. The theorem then follows by combining 
Steps 1 — 3 in the manner described above. © 
II: Derivation of Noncentrality Parameters for Jr and Cr 


Equation (5.24) can be combined with (5.14) to show that 


Jp © |H — P(o)(nq + Sps)? (5.39) 


where once again ng denotes a random vector with a N(0,J,) distribution. 
Equation (5.39) implies 


Jr 4 xp (vs) (5.40) 


where vy = u87? U, — P(0o)|S~!/215. Equation (5.24) can be combined 
with (5.30) and Step 2 of the proof of Theorem 5.5 to show that 


Cr  |I[Ar - B](mq + 37? us) IP? 


20 See Dhrymes (1984) [Proposition 55, p.66]. 

21 See Dhrymes (1984) [Proposition 16, p.24]. 

22 The arguments below use the property that trace(Dı D2) = trace( D2 D1) for any con- 
formable matrices D1, D2; see Dhrymes (1984) [Proposition 16, p.24]. 

23 For the third step, note that(5.38) implies 2 = 271. 


5.3 Testing Hypotheses About the Parameter Vector 161 


and hence that F 
Cr > Xero —po (V5) (5.41) 


where vg = HgS"? [A — B|S-"?us. At first glance, vy and vg appear 
different, but closer inspection reveals that they are identical. This follows 
because: (i) Ar = [H4 — P(@o)]; and (ii) equations (5.35) and (5.23) can be 
combined to show that BST"? us = 0. o 


5.3 Testing Hypotheses About the Parameter 
Vector 


There are many cases in which a particular economic theory implies a set of 
restrictions on the parameter vector of the econometric model. This means it is 
possible to assess the veracity of the theory by testing whether the restrictions 
in question are satisfied by the data. This section describes various methods for 
performing this type of inference. 

The structure of this testing problem is different from those described in the 
previous two sections. We now move into a world where the data are assumed 
to be generated by a model from the set M 4 defined by?* 


Ma = Elf (v, 8o)] = 0 for some unique 09 € O 


The question of interest is whether the data are generated by the subset of M4 
which satisfy 


Mar => Elf (vt,00)] = 0 for some unique 0) E€ ©, = {0 : r(0) =0} (5.42) 


where r(99) is a vector of nonlinear functions of 69. Notice that by definition 
©, C ©. Therefore, the issue is whether 6o lies in ©, or its complement in O, 
O£. This type of problem is often refered to as a nested hypothesis test because 
O, can be “nested” in © in the sense that O, is a subset of ©. 

The vector r(.) must satisfy certain conditions if the restrictions are to be 
meaningful. 


Assumption 5.3 Regularity Conditions for r(.) 

Let r : RP — R° be a (s x 1) vector of real valued functions which satisfies: 
(i) r(.) is a vector of continuous differentiable functions; (ii) rank{R(00)} = s 
where R(0) = ðr (0) /30". 


This assumption ensures that r(09) form a coherent set of equations — that is, 
given p — s elements of 0, it is possible to solve uniquely for the remaining s 
values using r(09) = 0.2° Notice that this property automatically excludes re- 
dundant restrictions, and also that the rank condition necessarily implies s < p. 


24 Previously we used 6+ to characterize M 4 but we use ĝo here for consistency with the 
specification of the hypotheses below. 

25 These conditions derive from the Implicit Function Theorem; for example, see Apostol 
(1974) [p.374]. 
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Newey and West (19876) develop the theory for testing 
HË : r(@) =0 versus HË : r(O) £0 


based on GMM estimators. They propose three main statistics which can be 
viewed as extensions to the GMM framework of the Wald, Lagrange Multiplier 
(LM) and Likelihood Ratio (LR) tests from Maximum Likelihood theory.”° To 
facilitate the presentation, it is useful to define unrestricted and restricted esti- 
mators of #9. The unrestricted estimator is just Or defined earlier. The restricted 
estimator is the value of 0 which minimizes Qr(0) subject to r(0) = 0; this is 
denoted 67. The asymptotic properties of the restricted estimator are derived 
in the technical details sub-section at the end of this section. It is assumed that 
both these minimizations use the weighting matrix oe We now introduce the 
three statistics in turn. 

The Wald test examines whether the unrestricted estimator, Êr, satisfies the 
restrictions with due allowance for sampling error. The statistic is 

-1 


Wr =Tr(6r)’ [Rêr) [Gr(6r)'Sz'Gr(6r)| !R(Ôry | r(6r) (5.43) 


The LM test examines whether the restricted estimator, Or, satisfies the first 
order conditions from the unrestricted estimation. This statistic is: 


LMr =T gr(6r)'Sp'Gr (6r)|Gr (6r)' ST Gr (Or) Gr (Or)! Sp" gr Or) 
(5.44) 
Finally, the D or LR-type test examines the impact on the GMM minimand of 
the imposition of the restrictions. This statistic is 


Dr =T|Qr(6r) — Qr(6r)] (5.45) 


In the context of Maximum Likelihood theory, it is well known that these 
three statistics are asymptotically equivalent under the null hypothesis. Newey 
and West (1987) [Theorem 2] show that this equivalence extends to the GMM 
setting. 


Theorem 5.6 Asymptotic Equivalence of Wr, LMr and Dr under HẸ 
If (i) Assumptions 3.1-3.5, 3.7-8.13, and 5.3 hold; (ii) SF" > S7}; then under 
HË: (a) Wr = Nr + 0,(1); (b) LMr = Nr + 0,(1); (c) Dr = Nr + 0,(1); 
where Nr = n V; tnr, nr = R(GhS-!Go)-!GhS—1T/2g7r(00), and Vz = 
RGoS 1G) tR. 


The proof is relegated to the technical details sub-section. One immediate 
consequence of Theorem 5.6 is that all three statistics share the limiting distri- 
bution of Nr under HË. This distribution is easily deduced from the definition 


of Nr because under our conditions it follows that np $ N (0, Vn ). Therefore 
we obtain the following distributional result. 


26 Tt should be noted that there are a number of asymptotically equivalent versions of these 
tests. Our presentation focuses exclusively on the versions proposed by Newey and West 
(1987b). See Newey and McFadden (1994) [p.2222] for a discussion of the alternative versions. 
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Theorem 5.7 The Limiting Distribution of Wr, LMr and Dr under 
Hy 

If (i) Assumptions 3.1-3.5, 3.7-3.18, and 5.3 hold; (ii) Sp' * S7}; then under 
HË: Wr 4 x2, LMr 4 x2 and Dr 2 x2 as T >œ. 


There is one other consequence of Theorem 5.6 which should be noted. Using 
similar arguments to Theorem 3.5, it is possible to show that nr is asymp- 
totically independent of S~!/2T1/2gp(67) under the conditions of the theo- 
rem. Since the large sample behaviour of Wr, LMr and Dr are governed 
by nr, it follows that these three statistics are also asymptotically independent 
of S-1/2T"/2gr(Êr). This, in turn, implies that Wr, LMr and Dr are asymp- 
totically independent of the overidentifying restrictions test statistic, Jr, under 
the composite null hypothesis that E[f (v+, 0o)] = 0 and r(09) = 0. 

Newey and West (1987b) show that the asymptotic equivalence of the statis- 
tics extends to local alternatives characterized by 


HE p: r(0o) = pris 
Furthermore, they show that the statistics converge to a y2(6z) where 


ÖR = te [Ro)(G,S*Go)! R(G0)'] | ur > 0 


So the statistics have power against the alternative for which they are designed. 
In view of their equivalence, some other criteria must be used to choose between 
the three. One such criterion is computational burden, although this is less of 
a concern now than it once was. The D statistic is more burdensome because it 
requires two estimations, whereas the Wald and LM only require one. Sometimes 
the unrestricted estimation is easier and sometimes not — it all depends on 
the model in question and the nature of r(.). However, the Wald test has 
two disadvantages which should be mentioned. First, it is not invariant to 
a reparameterization of the model or the restrictions. This means that it is 
possible to rewrite the model and restrictions in a logically consistent way, but 
end up with a different Wald statistic.?” Neither of the other two tests have this 
problem.”® The second disadvantage is that the Wald statistic tends to be less 
well approximated by the x? distribution in finite samples than the other two 
statistics; for example, see the simulation evidence reported in Gallant (1987). 

At this stage, it is useful to bring into the light one assumption that has been 
lurking in the shadows. Throughout the analysis in this section, it has been 
assumed that Assumption 3.3 holds and so E[f (v+, 0o)] = 0. It is important to 
realize that a violation of this assumption can also lead to a significant statistic. 
In other words, He may be rejected because either Assumption 3.3 holds but 
r(09) # 0 — or it may be rejected because the model is misspecified. Hall and 


27 For example, the restriction O08 = = 6; can also be rewritten as OF = = OF for any finite 
positive integer k. This sensitivity of the Wald statistic derives fom the sensitivity of the 
asymptotic standard errors to reparameterization; see Section 3.7. 

28 Davidson and MacKinnon (1993) [p.467-9] provide a useful discussion of this issue and 
some examples. Also see Critchley, Marriott, and Salmon (1996). 
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Inoue (2003) provide a formal justification for this statement within the frame- 
work of non-local misspecification employed in Chapter 4. Their results indicate 
that Wald, LM and D tests do not converge to limiting x2 distributions in mis- 
specified models even if the restrictions are satisfied. Furthermore, the limiting 
behaviour of the three test statistics depends crucially on the covariance matrix 
estimator employed. For example, Hall and Inoue (2003) show that Wr, LMr 
and Dr diverge to infinity in the case where either a centred or uncentred HAC 
estimator is used. These results emphasize the importance of using the model 
specification tests, Jr or Cr, before undertaking inference about the parameters. 

To conclude our discussion, it is useful to explore briefly a different per- 
spective on HË involving the identifying restrictions. It can be recalled from 
Section 3.3 that the identifying restrictions can be interpreted as the restric- 
tion that the projection of S~!/?E[f(v;,9)] onto the column space of F(6o), 
C[F(90)], is zero.2? We now show that the restrictions can be interpreted as 
a statement about the structure of C[F(09)]. To do this, notice that if HË is 
true then the Implicit Function Theorem implies that the population moment 
condition can be written as 


Elf (ve, go) = 0 (5.46) 


where jo is a p — s vector which satisfies 09 = g(wWo). Now, if (5.46) is treated 
as a basis for GMM estimation of pọ then the associated identifying restrictions 
imply the projection of S~'/? E| f (v+, 00)] onto the column space of F (yo) is zero 
where F(t) = S71? E[Of (vt, g(Wo))/Ov"]. However, since 


F (ho) = F(90) { Ag(vo)/Ov" } 


it follows that the column space of F (Yo), C[F'(wWo)], is of dimension p — s and 
CLF (40)] C C[F(80)]. 

It is interesting to contrast this perspective on HË with what we learned about 
testing Ho : E[f(v;, 9)] = Oin the course of our earlier analysis of the overidentify- 
ing restrictions test. The analyses in Sections 5.1.2 and 5.1.3 indicate that tests of 
the validity of the population moment condition revolve around the overidentifying 
restrictions which, it can be recalled from Section 3.3, involve the orthogonal com- 
plement of F (8o). Therefore, the fundamental decomposition inherent in GMM 
estimation reverberates into hypothesis testing based on the estimator: hypothe- 
ses about the parameters are equivalent to hypotheses about the columnspace of 
F'(69), and hypotheses about the population moment condition are equivalent to 
hypotheses about the orthogonal complement of F (0o). 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 


It can be recalled from Section 5.1 that the overidentifying restrictions test is 
significant when the asset is the index with equally weighted returns (EWR). We 
interpret this rejection as being indicative of misspecification, and so, in view 


29 Recall that in this chapter we focus exclusively on the two step or iterated estimator and 
so ST! must be substituted for W in the Section 3.3. 
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of the remarks above, do not consider that version of the model here. Instead 
we concentrate purely on the value weighted returns (VWR) case for which the 
overidentifying restrictions test is insignificant. 

Using L’Hopital’s rule it can be shown that limy9(c? — 1)/y = In(c).°° 
Therefore the restriction yọ = 0 reduces CRRA utility function to the log utility 
function. This restriction can be expressed in our general notation by putting 
r(09) = yo. If we define 0o = [4,00] then R(@) is given by 


R(9o) = [1, 0] 


It is immediately apparent that this choice of r(.) satisfies the regularity condi- 
tions in Assumption 5.3. The restricted estimation is performed using the pro- 
cedure constr in the MATLAB version 6.0 Optimization Toolbox (Mathworks, 
2000). 
Table 5.3 contains the Wr, LMr and Dr statistics for the test of Hg a y= 

0. All three statistics are coloulsted using Sr = §sv. From Theorem. 5.7, all 
three test statistics converge to a x? under this null. Notice that for this case 
the Wald test has a very simple form. Since r(6r) = 47 and R(6r) = [1,0] 
(5.43) reduces to 


where Vj; is the 1 — 1 element of [(Gr(6r)'Sz'Gr(6r)|7!. In other words, the 
Wald statistic is just the square of the “t-statistic” for yo = 0. 

In this particular example, the choice between the three statistics is of no 
consequence because they are identical to three decimal places. As can be seen, 
we fail to reject HË : yo = 0 at conventional levels of significance. 


Table 5.3 
Test statistics for HË : yo = 0 


Test Statistic p-value 
Wr 0.133 0.715 

LMr 0.133 0.715 
Dr 0.133 0.715 


Note: Wr, LMr and Dr are defined in (5.43)—(5.45). 


5.3.1 GMM Estimation Subject to Nonlinear Restrictions 
on fo and Other Technical Details 


I. The Asymptotic Properties of the Restricted GMM Estimator 
The restricted two step GMM estimator is defined by 


br = argmingco, Qr(6) (5.47) 


where ©, = {6 s.t. 0 € © and r(#) = 0} and Qr = gr(0)' S7" gr(0). 
Throughout, it is assumed that 69 satisfies the restrictions. 


30 See Rudin (1976) [p.109]. 
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Assumption 5.4 Restrictions on 4 
r(8o) = 0. 


The analysis is split into two parts: the consistency of ry and the asymptotic 
distribution of T!/2(6p — ĝo). As in Chapter 3, the logical sequence is to begin 
with consistency. 

A comparison of (3.11) and (5.47) indicates that the only difference between 
the restricted and unrestricted estimations stems from the set over which the 
minimization is taken. It is therefore straightforward to modify the proof of 
Theorem 3.1 to establish the following. 


Lemma 5.2 Consistency of Or 
If Assumptions 3.1 - 3.4, 3.7 — 3.10, 5.3 and 5.4 hold then: Ör > 8o 


While the characterization in (5.47) can be used to establish consistency, it 
does not lend itself to the derivation of the asymptotic distribution of T 1/2 (6p — 
0o). For this question, it is more fruitful to define Or using Lagrange’s method for 
constrained optimization. Accordingly, we introduce the Lagrangean function 


Lr(0, p) = Qr(A) — 2r(8)'p (5.48) 


where 2p is the (s x 1) vector of Lagrange multipliers.*' Subject to certain 


regularity conditions,°? 67 and the associated estimator of p, denoted pr, satisfy 
the first order conditions, OL(Or, pr)/00 = 0 and OL(Or, pr)/Op = 0. In this 
case, these conditions yield 


Gr(6r)'Sp'gr(Or) — R(Or) pr = 0 (5.49) 
—r(6r) = 0 (5.50) 


To derive the asymptotic distribution of T!/?(67 — 8o), it is necessary to know 
the probability limits of 0r and pr; the former limit is provided by Lemma 5.2 
above, and the latter is given in the following lemma. 


Lemma 5.3 Probability Limit of pr 
If Assumptions 8.1 — 3.5, 8.7 — 3.10, 3.12-3.13, 5.8 and 5.4 hold then: pr 2,0. 


This result can be derived by considering the limiting behaviour of (5.49) as 
T — œ, however we leave the details to the reader.33 

The asymptotic distribution of T!/2(@7 — 9) is deduced from (5.49)-(5.50). 
However, before this can be done, each equation requires a certain amount 
of manipulation, and so we start by considering each equation individually. 
Equation (5.49) implies that 


Gr(õr) STT? gr(ĝr) — R(ÕryY T? r = 0 (5.51) 


31 The factor of 2 is introduced for ease of presentation below. 
32 See Intrilligator (1971) [Chapter 3]. 
33 Or see Newey and McFadden (1994) [p.2218]. 
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Under our conditions, Gr(6r) B Go, and if we assume Sr È S then (5.51) can 
be rewritten as 


GSTT"? gr(ĝr) — R(0o) T? Sr + op(1) = 0 (5.52) 


The next step involves the use of the Mean Value Theorem to linearize T'/2g7 (Ôr) 
around T!/?g7(9). Under our assumptions, this linearized version implies 


T 297 (Or) = TV? gr (00) + GoT? (őr — 00) + 0,(1) (5.53) 

Finally, if (5.53) is substituted into (5.52) then we obtain 
GST 2 97 (00) + GhS~1GoT'/? (Or — 00) — R(O0)'T 7 fr + op(1) = 0 
(5.54) 


Now consider (5.50). The Mean Value Theorem and Lemma 5.2 can be used to 
deduce that 


Tr (Or) = T?r(0o) + R00) T"? (Ör — 9) + op(1) (5.55) 
Using (5.55) and Assumption 5.4, it can be seen that (5.50) implies 
R(09)T!/? (ðr — bo) + op(1) = 0 (5.56) 


Taken together, equations (5.54) and (5.56) imply that T!/?(6p— 0p) satisfies 
the following set of equations, 


0 E GLSTIT!/? gr (00) 
0 = 0 
Cs. -R T'2 (Or — 6 
+ | A E | a o) | + op(1)(5.57) 


where for brevity we set Ro = R(0o). Using the formulae for the inversion of a 
partitioned matrix,°* it can be shown that (5.57) implies 


TY? (67 — 00) = — {Vu — Vu R'[RVu RR!) RVy}G)S TIT? gr(00) + op(1) 

(5.58) 
where to simplify the formulae we have set Vy = (GhS~!Go)~1 — this notation 
reflects the fact this matrix is the variance of the asymptotic distribution for 
the unrestricted estimator; see Theorem 3.2. Notice that (5.58) has essentially 
the same structure as appeared at this stage in the analysis of the unrestricted 
estimator in Section 3.4.2: a matrix of constants times the vector, T!/?gr(60). 
So once again, the limiting distribution is normal. 


Lemma 5.4 Asymptotic distribution of T!/2(67 — 6) 
If (i) Assumptions 3.1-3.5, 3.7-3.18, 5.3 and 5.4 hold; (ii) Sp > S; then: 
T2 (ðr — 09) “+ N(0, Va) where Vp = Vy — Vy R'(RVy R') -1 RVg. 


34 See Magnus and Neudecker (1991) [p.11]. 
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Notice that Vy — Vp is a positive semi-definite matrix and so the restricted 
estimator is at least as efficient as the unrestricted estimator — in other words, 
we are never worse off for imposing valid restrictions on the parameters, as 
would be anticipated. 

A comparison of Lemma 5.4 and Theorem 3.2 suggests the limiting dis- 
tributions of the unrestricted and restricted estimator have much in common. 
However, there is one key difference which needs to be brought into the light. 
The matrix Vg has rank p — s and so the normal distribution in Lemma 5.4 is 
singular. Whereas, the limiting covariance matrix for the unrestricted estima- 
tor is nonsingular. This difference reflects the nature of the estimators. In the 
unrestricted estimation all the elements of Êr are “free”. In contrast, only p— s 
elements of 0p are “free” because the remaining s elements are tied down by 
the restrictions. 

This analysis has concentrated on the estimator under HË. In Section 5.3, 
certain comments are made about the behaviour of the restricted estimator un- 
der local alternatives H p. Newey and McFadden (1994) [p.2218~20] present a 
more general version of our analysis under this more general class of processes. 
Just as in Section 5.1.3, the only effective difference between HẸ and HẸ y 
appears in the mean of the asymptotic distribution. Therefore, both Lemmas 
5.2 and 5.3 continue to hold under local alternatives. 


II. Proof of Theorem 5.6 
Part (a): Under the assumptions listed in condition (i) of the theorem, it follows 


that: R(Or) > R(O) and Gr(6r) > Go. These two results combined with 
condition (ii) of the theorem imply that 


Wr = T'?r(6r)V,1T ?r(ôr) + op(1) 


Therefore, the result will be established if we can show that T!/2r(6p) = +nr + 
0,(1). To this end, we use the Mean Value Theorem to deduce that 


TY?r(ĝr) = TM?r(09) + R(Or,00,Ar)T 2 (Or — 9) (5.59) 


where R(Ôr, 99, Ar) is an (s x p) matrix whose it? row is the it row of RO) 
where a = rr i009 + (1 — dri) for some 0 < Avy < 1, and Ar is the (s x 1) 
vector with it? element Ari. Since Or za 0o, it follows that a? 4 0o and so 
R(6r, 90, àr) & R(0o). Using this result and r(6)) = 0 in (5.59), it follows that 


T™?r(ôr) = R(Oo)T'/?(Or — A) + op(1) (5.60) 
Equation (3.26) implies that 
T'? (67 — 6) = —(G4S~'Go) 1G, S122 97(80) + op(1) (5.61) 


Finally, the substitution of (5.61) into (5.60) yields T'/2r(67) = -nr + op(1), 
which completes the proof of (a). 
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Part (b): Lemma 5.2 establishes that 67 > 9o, and under the stated conditions 
Gr(Or) > Go. The second of these results can be combined with the consistency 


of Sp to deduce that LMr = LM r + 0p(1) where 
LMr = TY? 97(6r)'S*GoVy GS 1 T!? gr (67) + Op (1) (5.62) 


where Vy = (GoS~'Go)~!. So the desired result will be established if we can 
show that LMr = Nr + 0,(1). To this end, we now consider the limiting 
behaviour of Gi,S~!T!/2g7(6r). Using (5.53), it follows that 


GSTT? gr(ðr) = GYS T? gr(0o) + GSG T? (Or — 0o) + op(1) 

(5.63) 
Equation (5.58) provides an asymptotically equivalent expression for T!/?(67 — 
0o), and if this expression is subsitituted into (5.63) then we obtain 


GLSTIT"? gp (Or) = R'[RVu R' !RVuG STIT"? gr(0o) + op(1) (5.64) 


If (5.64) is substituted into (5.63) then — with appropriate cancellations — we 
obtain LM = Nr + op(1). 

Part (c): Once again, the proof rests in part on an application of the Mean 
Value theorem to T!/2g7(0r) but this time it is taken around T!/2g7(67) to 
yield 


T ?gr(ĝfr) = Tl? g7r(Or) + Gr(Or, Or, Ar)T/? (r — Or) (5.65) 


where Gr(6r, br, Ar) is the (qxp) matrix whose it” row is the it row of Gr (0?) 
where (this time) a = Nr Or +(1- Ara)Or for some 0 < Av; < 1 and Ar is 
the (q x 1) vector with it” element Ar. Since both Ôr and Ôr are consistent, it 


follows that a must also converge in probability to 6o for i = 1,2,...q. This 
property can then be combined with Assumptions 3.5, 3.12-3.13 to deduce that 
Gr(ðr, Ôr, Ar) È Go and so that (5.65) implies 


T!/?gr(ðr) = T?gr(ôr) + GoT? (ðr — ôr) + op(1) (5.66) 


If (5.66) is used to substitute for T!/2gr(Õr) in (5.45) then it emerges after a 
little rearrangement that 


Dr = 271? (6p — O7)'GoSp'T gr (Or) 
+T"? (fr — Op)'GoSp'GoT'/?(6r — Or) + op(1) (5.67) 
Clearly to proceed further, we need an expression for T!/?(@7 — 67). Since 
TY? (6p — br) = T? (Or — 09) — T*/*(6r — 80) 
it follows from (5.61) and (5.58) that 


TV? (fr — Or) = Vy R'[RVyu R'  RVguGoS 1T" ?gr(0o) + op(1) (5.68) 
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We note parenthetically that under our conditions (5.68) implies that 
T'2(6p — br) S N(0, Vu R'[RV RB’) RVWy ) 


Using (5.68), we can now deduce the limiting behaviour of the terms on the 
right hand side of (5.67) in turn. First consider 


Dır = 2T 1/2 (6p — Êr) G TT? gr (Or) 
The first order conditions for the unrestricted estimation, (3.12), imply that 
Gr(6r)'Sp'T'? gr(Or) = 0 (5.69) 


Since Gr(6r) > Go, it follows from (5.69) that GAST T"? gr(Êr) = op(1). 
Furthermore, (5.68) implies T? (ðr — Ôr) = Op(1). Therefore we can com- 
bine combine these two order in probability statements to deduce Dır = 
Op(1)op(1) = op(1). Now consider the second term on the right hand side 
of (5.67), namely 


Dər = TY? (fr — ôr)'G T GoT? (67 — Or) (5.70) 


It follow from the consistency of Sr and (5.68) that Dor = Nr +o,(1). There- 
fore Dr = Dir + Dor =Nrt+ op(1). © 


5.4 Testing Hypotheses About Structural 
Stability 


So far, it has been assumed that if Assumption 3.3 is violated then the value 
of E|f (v+, 0o)]| is the same for all t (albeit for a given T in the case of local 
misspecification). This property is refered to as structural stability. However, 
Assumption 3.3 is also violated if E[f(v,,00)| 4 0 for only part of the sample; 
such behaviour is termed structural instability. This section reviews various 
methods for testing structural stability based on GMM estimators. 

The null hypothesis for structural stability tests is very simple: it states that 
Assumption 3.3 holds throughout the sample. The alternative is more difficult, 
however, because it must specify how the model changes. In the GMM litera- 
ture, attention has focused almost exclusively on the case where the instability 
involves a discrete change at a single point in the sample known as the “break 
point”. So this scenario receives the most attention here. However, we briefly 
discuss other forms of instability at the end of the section. To present the null 
and alternative hypotheses, it is necessary to introduce the following notation. 
Let m be a constant defined on (0,1) and let 77 denote the potential break 
point at which some aspect of the model changes. For our purposes here, it is 
convenient to divide the original sample into two sub-samples. Sub-sample 1 
consists of the observations before the break point, namely T; = {1,2,...[7T]}, 
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where [.] denotes the integer part, and sub-sample 2 consists of the observations 
after the break point, Tə = {[7T] +1,...T}. This break point may be treated 
as known or unknown in the construction of the tests. If it is known, then the 
break point is specified a priori by the researcher and it is only desired to test for 
instability at this point alone. For example, we investigate below whether the 
change in operating procedures by the Federal Reserve in October 1979 caused 
instability in Hansen and Singleton’s (1982) consumption based asset pricing 
models. If the break point is unknown, then the null is the broader hypothesis 
that there is no instability at any point in the sample. It is easily imagined that 
tests for the two cases are closely related. We begin our discussion with the 
simpler case in which the break point is known because this provides a more 
convenient setting for introducing the null hypotheses and the test statistics. 
We then consider the extension of these techniques to the unknown break point 
case. 


5.4.1 Known Break Point Case 


As remarked above, the basic null hypothesis of structural stability is very 
straightforward, namely 


HES (n): Elf(v:,40)] = 0 for allt € T&T 


However, rather than work directly with HẸ® (r), it is useful to decompose this 
hypothesis into statements about the stability of the identifying and overidenti- 
fying restrictions. It can be recalled from Section 3.3 that these two sets of 
restrictions play different roles in the estimation, and we have already seen in 
this chapter that these roles are reflected in the types of inference question for 
which each is used. The identifying restrictions are imposed in estimation, and 
so underlie hypotheses about 69. The overidentifying restrictions are ignored 
in estimation, and so can form the basis for inference about the validity of 
the model specification. It emerges below that similar connections arise in the 
context of structural stability testing, and this leads to valuable model building 
information. It is therefore useful to decompose HÈS (r) to reflect these two 
possible sources of instability, and develop a test for each. 

To introduce these component null hypotheses, it is necessary to allow for 
the possibility that the data generation process for uv; is different in Tı and 
Tə. Accordingly, let F;[.] and Var,[.] denote the expectation and variance 
operators relative to the data generation process for vy; in T;. Furthermore, 
we define the following sub-sample analogs to P(0), F(0) and S: P;(@,7) = 
F;(0, w)[F; (0, r) F;(0, m)| 1 F3(0, TY, F;(0, T) = S0, r) V2 BOF (vt, 6;)/00"], 


[x7] 
Sı(01, T) = jim Variat] $ f0) 

t=1 P 
S2(02,7) = jim Vara|(T — [aT]? Sf (ve, 2)] 


t=[nT]4+1 
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Since the identifying restrictions are imposed in estimation, there are always 
parameter values which satisfy them in each of the two sub-samples. Therefore, 
the identifying restrictions are said to be structurally stable if they are satisfied 
by the same parameter value in each sub-sample. This null is formally stated 
as 


Hg (T) : P (80, 7) {51 (90, )} 1/7 Ex [f (ve, 90)] 
P2(00, T) {52(0, 7) }-1/? Ea[F (ve, 90)] 


In contrast, the overidentifying restrictions are ignored in estimation and so we 
can examine their stability directly. The overidentifying restrictions are said to 
be stable if they hold before and after the break point. This is formally stated 
as 


te T 


= 0, 
=0, téTy 


Hy (x) = Hy (r) & Hy? (x) 
where 


HO} (r) d [Iq = P, (81, 7)] {S1 (01, T)} E [f (v4, 01) = 0, te Tı 
H? (1) z [Iq ard P2(62,7)] {S2(02, r)} 712 E2[f (uz, 02)] = 0, t E To 


Notice that HO! (r) and HE? (r) allow for the possibility that the overidentifying 
restrictions are satisfied at different values in each sub-sample. 

By the very nature of the decomposition, it is clear that any instability must 
be reflected in a violation of at least one of the hypotheses Hj (r) and HQ(r). 
Therefore it follows that 


Hg" (1) = Ho (7) & Hy (7) 


The value of this decomposition is that it allows the researcher to discriminate 
between two scenarios of empirical interest. The first is one in which the insta- 
bility is confined to the parameters alone; this case is consistent with a violation 
of Hi (r) but the validity of HE (r). The second scenario is one in which the 
instability is not confined to the parameters alone but effects other aspects of 
the model; this would imply a violation of HẸ (r) and most likely Hé (r) as well. 
We now describe test statistics for each component, and then present their 
asymptotic properties. To this end, we introduce the following notation and an 
additional assumption. Let the sample moment in each sub-sample be 


[xT] 


arr) = [aT] >> f0) 


T 
g2,r(;") = (T—[rT])* X w0) 


t=[nT]4+1 


and 1 7(7), S2,r(m) be consistent estimators of S1 (61,7), S2(62, T) respectively. 
With these definitions, the sub-sample two step GMM estimators are 


a nA 


Oirr) = argmingeo gi,r(9;7) Sirla) ‘9:7 (0; T) (5.71) 
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for i = 1,2. We also need the sub-sample derivative matrices, 


[xT] 
Gir(00) = [hT] > afv 0)/06' 
T 
Gor(0;") = (T-[rT])"* XO afle 0)/30 
t=[rT]+1 


The additional assumption governs the dependence — or, more appropriately, 
the lack of it — between the two sub-samples. Throughout the discussion we 
impose the following condition. 


Assumption 5.5 Zero Covariance of Partial Sums 
limp oo Cov|T!/2g1 r (b0; 7), T/292,r (00; 7)| = 0. 


This assumption is not guaranteed under ergodicity but can be justified under 
certain mixing conditions; see Andrews (1993). 

From our earlier discussion, it can be recognized that Hj (r) is equivalent 
to a null hypothesis of no parameter variation. Andrews and Fair (1988) derive 
test statistics for the latter hypothesis and it is most convenient to follow their 
approach here. Therefore, we introduce the augmented population moment 
condition: 


di(m) f (ve, 01) 2 
Elg(vz, ¢0)| = (1 — di(7)) f (vz, 02) | — 0 (5.72) 
where d(T) is a dummy variable which equals one when t < aT and o = 
(01, 05) . Notice that this population moment condition is more general than 
Assumption 3.3 because it allows for the possibility that E[f(v;,0)] = 0 is 
satisfied at different parameter values before and after the break point. However, 
if do satisfies the restrictions 


(Ip, —Iplġo = Op (5.73) 


then 0; = 2 and so the moment condition is satisfied at the same param- 
eter value throughout the sample. This structure suggests a straightforward 
method for testing Hj (r): estimate o by GMM based on (5.72) and then use 
the Wald, LM or LR-type statistic from the previous section to test the re- 
strictions in (5.73). This approach requires calculation of the unrestricted and 
restricted estimators of ¢9 denoted by bur and ORT respectively. The un- 
restricted estimator is ĝy,r = ĝi rr), ĝo rr) J’. The restricted estimator is 
Êr,r = [Or(m) ,Õr(r)] where 


2 


Õr(r) = argminece N gir (0; m) Sir(m) 19:00; 7) (5.74) 

i=1 
However, Andrews (1993) shows that sup,¢(o,1)|T!/?(67 — Or) || = op(1) under 
the null hypothesis, where Êr is the “full sample” GMM estimator defined in 
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(3.11). As a consequence, the limiting distribution theory is unaffected by the 
use of the full sample GMM estimator in place of the restricted estimator. 
Since the full sample estimator has almost certainly been calculated prior to 
the implementation of the structural stability tests, there is some convenience to 
making this substitution. Therefore, Andrews proposes versions of the LM and 
D tests that are based on the full sample estimator and we follow this practice in 
our presentation as these versions have become common in practice. However, 
we note in passing that this substitution may have a considerable impact on 
the value of the statistic in practice; see Section 9.2 for further discussion in the 
context of an empirical example. 
The Wald statistic is given by 


Wem) = T [ô rr) - bo.0(m)] Üw (r)? [6r.0(m) - b2,0(m)| (6.75) 
where 
Vw(t) = [G1.261,2(0)s1)'81.2() Gir (rC): + 


[Go r (z r(T); T) S2,7(1) Gor (62,7 (1); 7)] 71 (5.76) 


l-r 


and ĝ; r(t) denotes a consistent estimator of $;(7) based on the unrestricted 
estimator 6; 7(7). The LM statistic is given by 


LMr(r) pr Or: n) Sr Grr) Gr (rV 87" Gr (êr) x 
Gr(6r)' Sz g, rlÊr; n) (5.77) 
The D statistic is given by, 
Dr(r) = T[J (Ôr, êr; r) — J (Êi r(r), ĝ2,r (7); T) (5.78) 
where 
J(01,02,07) = Tgirl; T) irl) grn) + 


(1 = 7) g2,r (82; T) S2,r(n) > 92,7257) (5-79) 
To test HO (r), Hall and Sen (1999) propose the statistic 
Or(m) = Oirr) + O2,7r(7) (5.80) 


where O;,r(m) and O2,r(m) are the overidentifying restrictions tests based on 
the sub-samples T) and Tp respectively, that is 


Orr(n) = [aT] 91,7 (61,r(0)3 r) rlr) g rÂ rT); 7) (5.81) 
Oor(m) = (T -— [rT])g2 r (êz r(1); 7)! 82 r(1)"g2,r(ĝ2,r (T); 7) (5.82) 


The following theorem gives the limiting distribution of these statistics. For 
brevity, we state the result in terms of the Wald test but the same results apply 
to either the LM or D statistics. For convenience, we also state the distributional 
results under the composite null HŠS (7). 
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Theorem 5.8 Limiting Distributions of W;(7) and O7(7) under H}* (7) 
If Assumptions 3.1-3.5, 3.8-3.13 and 5.5 hold then: (i) Wr(n) * x2; (ii) 
Or(r) È Kiasi (iii) Wr(Tr) and Or(T) are asymptotically independent. 


Part (i) is first presented in Andrews and Fair (1988)[Theorem 4], and its 
proof has been anticipated in the derivation of the test statistic above. Parts 
(ii)—(iii) are presented in Hall and Sen (1999) [Theorem 2.1]. There is a simple 
intuition behind part (ii): Theorem 5.1 can be used to justify that Oır(r) and 
O2,r (7) are individually x2_, and then Assumption 5.5 implies their asymptotic 
independence which gives the stated result. Part (iii) derives from Assumption 
5.5 as well as the arguments which underlie Theorm 3.5.35 

It can be recalled that the decomposition of HS (mr) was motivated by the 
potential to uncover useful information about the source of the instability. To 
assess whether this potential is realized, we must explore the behaviour of the 
test statistics under an alternative hypothesis which allows for instability. Hall 
and Sen (1999) show that Wr (r) has power against local alternatives to Hé (7), 
denoted H} (r), but none against local alternatives to HE (r), denoted H9(r). 
Whereas, Or(7) has power against HQ (r) but none against H4(7). Further- 
more these two statistics are also asymptotically independent under the compos- 
ite local alternative H4 (m)& HÌ (7). These results suggest that the two statistics 
can be combined to discriminate between local instability which is due solely to 
parameter variation and local instability of a more general nature. Interestingly, 
Hall and Sen (1999) show that this conclusion holds even if the wrong break 
point is used in the calculation of the tests.” However, the same conclusion 
only holds for non-local alternatives if the correct break point is used. We return 
to this issue when we describe the extension of these statistics to the unknown 
break point case in the next sub-section. 

At the conclusion of this sub-section, we illustrate the tests for the Hansen 
and Singleton’s (1982) consumption based asset pricing model. However, before 
that, we briefly describe two other statistics which could be used to test for 
instability. These are the overidentifying restrictions test and the Predictive 
test. 

Since the overidentifying restrictions test is the standard diagnostic for model 
specification, it is interesting to consider its properties against structural insta- 
bility. Ghysels and Hall (1990a) show that Jr is insensitive to Hf (7r) and Sen 
(1997) shows that it has power against HỌ (r). The arguments behind each are 
essentially the same as those used to establish that Jr has power against HQ 
but none against H} in Section 5.1.3. Hall, Inoue, and Peixe (2003) consider 
the limiting behaviour of Jr in the presence of non-local misspecification due 
to neglected structural instability. They provide conditions for the test to be 
consistent but show that these are not guaranteed to hold in all circumstances. 


35 It should be noted that Theorem 5.8 (i) only requires H (r) to hold, and part (ii) 
only requires HỌ (m) to hold — provided also that the other regularity conditions are suitably 
modified; see Hall and Sen (1999). 

36 In other words, the test is calculated with m = 7», say, but the true break point is [roT]. 
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This is because while there may be no single value of 0 that satisfies the pop- 
ulation moment condition for every observation, there can be a value of 0 that 
sets the average of these population moment conditions to zero. While note- 
worthy, such a scenario is likely to be the exception rather than the rule. So 
for practical purposes, it is reasonable to conclude that the overidentifying re- 
strictions test can detect neglected structural instability in many settings. In 
spite of these properties, intuition suggests that Wr(7) and Or(z) are likely to 
be more powerful tests than Jr against structural instability because they are 
specifically designed for that alternative. Simulation evidence reported in Sen 
(1997) supports this view. 

Ghysels and Hall (1990c) proposed the Predictive test to discriminate be- 
tween HÈS (r) and the alternative hypothesis 


HEF (x): E,[f(vz,00)] = 0, t€T, and Epalf(v%,0)| 40, t € Te 


The statistic is based on evaluating the sample moments from T at Êi 7(n). 
Under H}*(z), this estimated sample moment should converge in probablity to 
zero. This approach leads to the Predictive test statistic 


PRr(m) = Tgr (ĝi r (1); 7)! Veg 92,7 (Âi, r (7); 7) 


where Vp, is a covariance matrix defined in Ghysels and Hall (1990c). Ghysels 
and Hall (1990c) show that this statistic converges to a x? distribution under 
H§*(x).37 Ghysels, Guay, and Hall (1997) show that 


HAR (T) = HAr) & HY (r)& H3 (r) 


In other words, the Predictive test has no power against violations of HO!(7). 
This feature renders the Predictive test less attractive than the combined use 
of Wr(m) — or LMr(T), Dr(m) — and Or(r) described above and so we do not 
pursue it further here.’8 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

Since our sample spans five decades there are many events which may have 
caused structural instability in an asset pricing model. To illustrate the tests 
described above, we pick one such event: the change in the operating procedures 
of the Federal Reserve in October, 1979. During the 1960s and most of the 1970s, 
the Federal Reserve used the federal funds rate as its primary operating target 
for monetary policy.3? In October 1979, it was decided to change this practice to 
one in which the level of non-borrowed reserves became the primary operating 


37 Ghysels and Hall (1990a) propose a structural stability test based along a similar princi- 
ple to the Eichenbaum, Hansen, and Singleton (1988) statistic in Section 5.2, but Ahn (1995) 
shows this is asymptotically equivalent to the Predictive test — their finite sample properties 
may be different, however. 

38 Ghysels, Guay, and Hall (1997) also extend the Predictive test to the unknown break 
point case. 

39 The federal funds rate is the interest rate on funds loaned overnight between banks. 
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target.° It has been argued in the literature that this change in Fed policy may 
have had sufficient impact on the financial environment to cause instability in 
asset pricing models.*! 


The evidence from the overidentifying restrictions test suggests that this 
model may be correctly specified for value weighted returns (VWR), but mis- 
specified for equally weighted returns (EWR). Both conclusions leave scope for 
the use of structural stability tests although for different reasons. It can be 
recalled above that the overidentifying restrictions test has power against struc- 
tural instability but is anticipated to be less powerful than tests specifically 
designed for this alternative. So for VWR, the motivation is that the failure to 
reject with the overidentifying restrictions test may simply reflect the low power 
of the test against structural instability. Whereas for EWR, the motivation is 
to assess whether the significance of the overidentifying restrictions test can be 
attributed to structural instability. Table 5.4 reports the structural stability 
test statistics associated with the October 1979 break point. For brevity, we 
only report results based on wi) = (p= S z124)! and Sp = Ísu given in 
(3.40). 

For VWR, the overidentifying restrictions based tests are all insignificant 
at the 10% level. However, the evidence from the parameter variation tests is 
mixed. The Wald and LM tests are insignificant but the D test is just signifi- 
cant at the 10% level. Unfortunately, there is no obvious way to interpret this 
discrepancy between the tests of parameter variation. Statistical theory tells us 
only that Wr(7), LMyp(m) and Dr(z) are asymptotically equivalent under the 
null and local alternatives but this does not imply the tests need be numerically 
identical in finite samples. However, one possible explanation is that the Dr (r) 
is calculated using the full sample GMM estimator as the “restricted estimator” . 
While this substitution is asymptotically valid, it may inflate the value of the 
statistic because it follows from (5.74) that J(67,67;) > J(Or, Or; 7).2? 


For EWR, the evidence is more clear cut. All the parameter variation tests 
are insignificant at the 10% level, but the overidentifying restrictions based tests 
indicate instability. Both Or(m) and O2,r(7) are significant at the 10% level, 
but Oj, ,7(7) is insignificant. This pattern of results suggests the model spec- 
ification is correct prior to 1979:9, but misspecified thereafter.*? Provided we 
accept the general framework of the consumption based asset pricing model, the 
most logical source of this misspecification is the representative agent’s utility 
function. So with this proviso, the evidence is consistent with the following sce- 
nario. The representative agent possesses a CRRA utility function for the period 
1959:3-1979:9, but then the functional form of this utility function changes as 


40 See Mishkin (1995) for a historical review of the Federal Reserve’s monetary policy. 

41 See inter alia Ghysels and Hall (1990a). 

42 See Section 9.2. 

This conclusion appears at odds with the results reported in Hansen and Singleton (1984) 
who report a significant overidentifying restrictions test for the model with EWR. However, 
the overidentifying restrictions test based on the pre break sample is sensitive to the choice 
of break point; see Section 5.4.2 for further details. 
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a result of some event in 1979.10.44 However, there is one important caveat. 
Although we have selected this break point for a reason, all these results may be 
sensitive to the choice of break point and so it is important to conduct a more 
thorough investigation before drawing any definitive conclusions about the im- 
portance of this date. This is undertaken at the end of the next subsection. 


Table 5.4 
Structural stability tests associated with October 1979 


Asset : VWR EWR 
Test Statistic p-value Statistic p-value 
Wr(m) 2.640 0.267 1.810 0.405 
LMr(r) 3.382 0.184 0.888 0.641 
Dr(r) 5.040 0.080 2.543 0.280 
Or(7) 4.135 0.658 12.031 0.061 
Oır(r) 1.535 0.674 4.288 0.232 
Or (7) 2.601 0.457 7.743 0.052 


Note: Wr(r), LMr(m) and Dr(r) are defined in (5.75), (5.77) and (5.78), Or(r), Oir(r) 
and Ozr(r) are defined in (5.80), (5.81) and (5.82). 


5.4.2 Unknown Break Point Case 


If the break point is unknown, then it is desired to test whether there is evidence 
of instability at any point in the sample. However, in practice, it is necessary 
to limit attention to the null hypothesis: 


HEF (T) = HES (r), for all r € I C (0,1) (5.83) 


On one hand, it is desirable for II to be as wide as possible so that the null is as 
broad as possible. On the other hand, it must not be so wide that asymptotic 
theory is a poor approximation in the sub-samples. In applications to models 
of economic time series, it has become customary to use II = [0.15,0.85]. As 
in the fixed break point case, we decompose the null into components involving 
the stability of the identifying and overidentifying restrictions, that is 


HS (T) = H(I) & HE (T1) (5.84) 

where 
HIT) = Hé(n), for alla € N (5.85) 
HOTI) = HO (x), for alla ell (5.86) 


44 See Sen and Hall (1999) for further discussion. 
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We begin by describing statistics for testing H(I). The construction is 
a natural extension of the fixed break point methods. Now W(z), say, is 
calculated for each possible m to produce a sequence of statistics indexed by 7, 
and inference is based on some function of this sequence. This function is chosen 
to maximize power against a local alternative in which a weighting distribution 
is used to indicate the relative importance of departures from Hé(z) in different 
directions at different break points. A general framework for the derivation of 
these optimal tests is provided by Andrews and Ploberger (1994) in the context 
of Maximum Likelihood estimators and this is extended to the GMM framework 
by Sowell (1996). One drawback with this approach is that a different choice 
of weighting distribution leads to a different optimal statistic; however, three 
choices have received particular attention. To facilitate their presentation, we 
define the following local alternative to H (7), 


Hh p(n): Pi (90; 7){S1 (00,7) } P Ey rlf (ve, 00)] =P? ur, ten 
P2(90;7){$2(90,7)} P Earl (ve, 90)] = Tur,  t € Te 


It is assumed that uz, ı = 0 and a weighting distribution is specified for (ur 2, 7).4 


The aforementioned three choices are as follows: 

Choice 1: 

If the conditional weighting distribution of ur ,2 given 7 is of the form rL(7)U 
where r is a scalar, L(r) is a particular matrix and U is the uniform distribution 
on the unit sphere in R? then Andrews and Ploberger (1995) show that for r 
sufficiently large the optimal statistic is 


SupWr = sup {Wr(r) } 
well 

Choices 2 and 3: 

If the conditional weighting distribution of z.2 given m as N(0,c¥,), for some 
constant c. Andrews and Ploberger (1994) and Sowell (1996) show that for a 
particular choice of ,, the optimal statistic only depends on c and not Xr. 
So, for convenience, this choice is made and then attention has focused on two 
values of c. If c = 0 then the optimal statistic takes the form 


AvWr = f wemarc) 


where J(7) defines the weighting distribution over 7. If c = oo then the optimal 
statistic takes the form 


Pawe sia { f ezpl0.5Wr(a)laJ (a) } 


In principle, AvWr and ExpWr can be calculated with any choice of marginal 
distribution for m. However, it has become customary to assume this distribution 
is uniform over II. 


45 For these tests of parameter variation, the roles of HI,1, HI,2 can be interchanged. 
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As they stand these statistics are not operational because we have treated 7 
as continuous, whereas in practice it is discrete. For a given sample size, the set 
of possible break points are Tp = {i/T; i = [nLT], [nT] + 1,...,[tuT]} where 
mp and ty are respectively the lower and upper endpoints of the closed interval 
II. So in practice, inference is based on the discrete analogs to SupWr, AvWr 
and ExpW7, that is 


SupWr = sup {Wr(i/T)} (5.87) 
i€T, 
[ru T] 
AvWr = d(mi,mu,T)' XO Wr(i/T) (5.88) 
i=[rLT] 
[ru T] 
ExpWr = log < d(mz,7u,T)~' 5 exp|0.5Wr(i/T)] (5.89) 
i=[rLT] 


where the last two statistics are specialized to the case in which the weighting 
distribution for m is uniform on II, and d(zz,7y,T) = [tu T] — [rT] +1. 

Andrews (1993, 2003) and Andrews and Ploberger (1994) derive and tabulate 
the limiting ditributions of SupWr, Avr and ExpWr under Ho(II). We delay a 
discussion of the theoretical arguments to the end of this section. Critical points 
for these distributions are reproduced here for II = [.15,.85] in Table 5.5.46 
These enable the researcher to ascertain whether the statistic is significant at 
preascribed level. Hansen (1997) reports response surfaces which can be used 
to calculate p-values for all three versions of these tests. As a reminder, all the 
previous remarks equally apply to the corresponding functionals of LMr(z) or 
Dr (r). 


46 Table 5.5 only contains parts of the tabulations reported by Andrews (1993) and Andrews 
and Ploberger (1994). They report critical points for p = 1, 2, ... 20 and other choices of II. 
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Table 5.5 
Critical points for SupWr, AvWr and ExpWr 


Statistic: SupWr 


p 10% 5% 1% 
1 T412 8.68 12.16 
2 10.00 11.72 15.56 
3 12.28 14.13 18.07 
4 14.34 16.36 20.47 
5 16.30 18.32 22.66 
6 18.11 20.24 24.74 
7 19.87 22.06 26.72 
8 21.55 23.82 28.55 
9 23.20 25.54 30.42 

10 24.80 27.13 32.31 

11 26.38 28.81 33.96 

12 27.90 30.43 35.67 

Statistic: AvWr 

p 10% 5% 1% 
1 2.16 2.88 4.72 
2 3.75 4.61 6.73 
3 5.10 6.07 8.21 
4 6.50 7.67 10.18 
5 7.76 9.01 11.32 
6 9.02 10.19 12.93 
T 10.28 11.47 14.34 
8 11.54 12.94 16.14 
9 12.71 14.16 17.30 

10 13.77 15.29 18.72 

11 15.00 16.46 19.44 

12 16.31 17.85 21.03 

Statistic: ExpWr 

p 10% 5% 1% 
T 1.51 2.06 3.41 
2 2.59 3.22 4.76 
3 3.49 4.22 5.77 
4 4.37 5.23 7.13 
5 5.22 6.13 7.91 
6 6.01 6.92 8.96 
7 6.70 7.66 9.53 
8 7.58 8.60 10.96 
9 8.31 9.35 11.67 

10 9.00 10.04 12.61 

11 9.69 10.75 13.21 

12 10.45 11.55 13.83 


Source: Andrews (2003)[Table 1] and Andrews and Ploberger (1994) [Tables 1 and 2]. Copy- 
right: The Econometric Society. Reproduced with permission. 

Notes: the figures represent the critical points for the three tests at the 10%, 5% and 1% 
significance level for II = [0.15, 0.85]. 
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The same ideas can be used to construct tests of the null hypothesis that 
H (x) holds for all m € II against the alternative that 


H (1) = Hh(r) & H9? (r) for all x € I 
where 


H3r(T) : Ug — Pur, 1) {51(@1,7)} "E, cl (ve, 01)] 
= T7"? u01, t € Tı 


HQ%(m) : Ha — Po(02,T)] {82(02, 1)} "Ez, r[f (vt, 02)] 


= T"? u02, t € To 


Hall and Sen (1999) propose using the following statistics: 


SupOr = Bu Oru f) (5.90) 
tETp 
[tu T] 
AvOr = d(nz,tv,T)' X` Or(i/T) (5.91) 
i=(["1T] 
[tu T] 
ExpOr = log § d(ar, nu, T) X. expl[0.507(i/T)] (5.92) 
i=["1T] 


However, although the functionals are the same it has proved impossible to 
date to deduce any optimality properties for the resulting tests along the lines 
described above.4’ Hall and Sen (1999) derive and tabulate the limiting dis- 
tributions of these three statistics under HỌ (II). Once again, we postpone a 
discussion of the derivation until the end of this sub-section. Critical points for 
these distributions are reproduced here in Table 5.6. Sen and Hall (1999) report 
response surfaces which can be used to calculate p-values for all three versions 
of these tests. 


47 It is possible to derive optimal tests against the more restrictive alternatives 
HY? (w)& H9? (T) for all m € II or AQ tr (m)& HP? (r) for all m € II but the statistics are 
different in each case; see the discussion in Hall and Sen (1999). However, notice that both 
these alternatives restrict the violation of the population moment condition to occur either 
after or before the break point. Whereas, in practice, a researcher typically lacks that kind of 
a priori information, and so we do not pursue those tests here. 
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Table 5.6 


Critical points for SupOr, AvOr and ExpOr 
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Statistic: SupOr 
q—p 10% 


8.70 
12.78 
16.33 
19.65 
22.81 
25.70 
28.76 
31.61 


CONAMBRWNE 


Statistic: AvOr 
q—p 10% 


4.17 
7.21 
9.91 
12.51 
15.01 
17.52 
19.91 
22.44 


CONDOUBWNH 


Statistic: ExpOr 
q—p 10% 


2.45 
4.17 
5.73 
7.20 
8.64 
10.02 
11.41 
12.79 


CONDOUBWNH 


5% 


10.39 
14.75 
18.53 
21.99 
25.31 
28.32 
31.45 
34.53 


5% 


5.37 

8.60 
11.54 
14.32 
17.01 
19.68 
22.24 
24.78 


5% 


3.13 
4.99 
6.69 
8.26 
9.77 
11.21 
12.69 
14.12 


Source: Hall and Sen (1999)[Table 1]. Copyright 1999 by the American Statistical Association. 
Reprinted with permission from the Journal of Business and Economic Statistics. 
Notes: the figures represent the critical points for the three tests at the 10%, 5% and 1% 


significance level for II = [0.15, 0.85]. 


Which functional should be used? Simulation evidence suggests that no one 
test dominates the others.4* So, unless your priors happen to correspond to one 
of the weighting distributions underlying the statistics, it is probably best to 
calculate all three, and this seems to have become the most common practice. 
However, the Sup test does have one attractive feature not shared by the other 
two. If SupWr, say, occurs at t = tg then ww = tg/T provides an estimate 


48 See Hall and Sen (1999). 
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of the break point fraction. To date, it is unknown whether this estimate is 
consistent for m under the alternative but there are grounds for conjecturing 
that this property holds in just-identified models at least.49 This remains an 
interesting avenue for future research. 

It can be recalled that the decomposition of the null hypothesis has been 
motivated by its potential to provide useful model building information. In the 
previous sub-section, it is argued that this potential is realized for local instabil- 
ity regardless of the true break point but is only realized for non-local instability 
if correct break point has been identified. The latter property is underscored 
by simulation evidence in Hall and Sen (1999) which shows that SupOr, AvOr 
and ExpOr have power against non-local parameter variation. These properties 
prompted Hall and Sen (1999) to propose the following strategy. 


Hall and Sen’s (1999) strategy for diagnosing the source of the instability. 


Case 1: If all the unknown break point tests fail to reject then this is evidence that 
all aspects of the model are stable. 


Case 2: If the parameter variation tests are significant and either the unknown 
break point overidentifying restriction based tests are insignificant or 
Or(îw ) is insignificant, then this is evidence of parameter variation. 


Case 3: In all other situations, the tests indicate that there is instability that in- 
volves more than just the parameters. 


Two comments are in order. First, note that the method is premised on 
the assumption that tw is consistent for m if the instability is confined to the 
parameters alone. Secondly, Hall and Sen (1999) propose evaluating the sig- 
nificance of Or(îw) using the appropriate critical point of the y3 zi dis- 
tribution, and this ignores a sampling error associated with the estimation 
of the break point. However, they report simulation evidence which suggests 
this distributional approximation is reasonably accurate. Their simulation ev- 
idence as a whole suggests that the strategy provides a feasible method for 
discriminating between parameter variation alone and more general forms of 
instability. 

One final point should be noted. Although, all these tests are designed 
against an alternative in which there is instability at a single point in the sam- 
ple. All the tests have non-trivial power against other forms of instability. We 
do not reproduce the argument here but instead refer the reader to the papers 
already cited above.5? 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model®*! 


49 For example, Nunes, Kuan, and Newbold (1995) prove its consistency in linear regression 
models estimated by quasi maximum likelihood. 

50 Also see Section 5.4.3. 

51 See Section 9.2 for another empirical example of these tests. 
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Table 5.7 reports the structural stability tests when the break point is treated 
as unknown. Following accepted practice, we set II = [0.15, 0.85] which means 
the potential break point is assumed to lie between 1965:1 and 1992:2. For 
brevity, we only report results based on using wh) = (To, 4%) and 
the covariance matrix estimator Êr = Seu. Parenthetically, we note that if 
we = 1057; is used then the sub-sample estimates diverge in a few cases. 

For VWR, all the statistics are insignificant at the 10% level, and so these 
tests provide no evidence of misspecification in this case. For EWR, all the 
parameter variation statistics are insignificant at the 10% level, and all the 
overidentifying restrictions based tests are significant at the 5% level. This ev- 
idence clearly indicates misspecifcation, and so is consistent with our earlier 
findings based on the overidentifying restrictions test. However, the application 
of the structural stability tests provides further information about the nature 
of the misspecification. Using Hall and Sen’s (1999) diagnostic strategy de- 
scribed above, the pattern of results suggests that the misspecification cannnot 
be attributed to parameter variation alone. 

Table 5.7 also reports the dates associated with the supremum of each test. 
Two features of these results stand out. First, the supremum for the parameter 
variation tests occurs at virtually the same point for a given choice of asset — in 
spite of the insignificance of the tests concerned. Secondly, the supremum for 
the overidentifying restrictions test occurs at the second possible breakpoint, 
that is 1965:2, for each choice of asset. This could reflect instability, but there 
is another explanation which needs to be noted. It can be recalled that the 
statistical theory behind the tests relies on the applicability of asymptotic theory 
in each of the sub-samples. At either end of II, one of the sub-samples consists 
of only seventy observations, and it may be that this is not sufficiently large for 
asymptotic theory to provide a good approximation. In that case, the supremum 
may occur close to 7 = 0.15 or m = 0.85 simply because the sequence of test 
statistics has not converged in distribution over the entire interval II. Figures 
5.1-5.2 plot the individual test statistics for Wr (r) and Or(z) against m for 
each choice of asset. The plots for Dr(m) and LMr(r) are qualitatively similar 
to those for Wr (r) and so are omitted for brevity. © 
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Table 5.7 
Structural stability tests with unknown break point 

VWR: 

Test Sup— Date Av— Exp- 
W 4.899 1982:7 1.468 0.899 

LM 5.603 1982:9 2.095 1.194 
D 5.852 1982:7 2.239 1.427 
O 12.759 1965:2 5.751 3.704 
EWR: 

Test Sup— Date Av— Exp- 
W 4.893 1975:1 1.035 0.623 

LM 5.262 1975:1 0.896 0.571 
D 6.909 1975:2 1.642 1.123 
O 22.580 1965:2 13.712 8.093 


Note: W, LM, D and O denote the tests based on Wr (r), LMr (7r), Dr(x) and Or (r) defined 
in (5.75), (5.77), (5.78) and (5.80). Date denotes the date associated with the Supremum 
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Figure 5.1: Wald and overidentifying restrictions tests for structural instability 
for the consumption based asset pricing model with value weighted returns 
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Figure 5.2: Wald and overidentifying restrictions tests for structural instability 
for the consumption based asset pricing model with equally weighted returns 


5.4.2.1 Technical Details 


There are two main steps to the analysis of structural stability tests derived 
above for the unknown break point case. First, it is necessary to character- 
ize the limiting behaviour of individual members of the sequence of statistics 
{Wr(T);n € II} and {Or(r);m € II}. Secondly, these characterizations are 
used to deduce the limiting behaviour of the various functions of the sequences 
in which we are interested. The first part is closely related to our earlier analysis 
of the statistics for the fixed break point case. However, this time, the results 
must apply for all 7 € II, and this requires different techniques and assump- 
tions. Below it is shown that the limiting distributions of the test statistics 
revolve around two continuous time processes known as a Brownian Motion and 
a Brownian Bridge. Therefore, we begin with definitions of these processes. 


Definition 5.1 Brownian Motion 

A n dimensional Brownian Motion Bn(.) is a continuous time process associ- 
ating each date r € [0,1] with the (n x 1) vector By,(r) satisfying the following 
properties: 


(i) B(0) = On where On is a (n x 1) vector of zeros. 


(ii) For any dates O < rı < r2 < ... < rk < 1 the changes {Bn(ri) — 
B,(ri-1), i = 2,3...k} are a set of mutually independent random vec- 
tors with Bn(ri) — Bn(ri-1) ~ N (0n, (ri — Ti-1) In). 


(iii) For any given realization, Bn(r) is continuous in r with probability one. 
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A Brownian motion is the continuous time analog to a random walk, and is 
widely used in analyses of diffusion processes.°? 


Definition 5.2 Brownian Bridge 

A n dimensional Brownian Bridge BB, (.) is a continuous time process asso- 
ciating each date r € [0,1] with the (n x 1) vector BB,(r) = B,(r) — rB,(1) 
where B,(.) is a Brownian motion. 


Notice that a Brownian bridge both begins and ends at zero. 

Below we establish that certain statistics converge in distribution to the 
distributions possessed by particular functions of Brownian Motions or Bridges. 
Such statements require an additional notation. Accordingly, we use ar = b to 
denote the statement ar converges in distribution to the distribution possessed 
by the random variable b. More succinctly, ar is said to weakly converge to b. 

From (5.75) and (5.80), it is clear that the analysis of Wr(m) and Or(r) 
is going to require assumptions about the partial sum, T~! ya f (ve, 90), its 
long run variance and the associated derivative matrix. To this end, we assume 
the data generation process satisfies the following assumptions.*? 


Assumption 5.6 Uniform Convergence of the Variance of the Partial 
Sums 
supren ||Var [T EE f(x, 80)] — 75 > 0. 


Assumption 5.7 Uniform Convergence of the Partial Derivative Ma- 
trix 


supren IIT EET OF (v2, 60)]/06" — Gol] S 0. 


Assumption 5.8 Functional Central Limit Theorem 
S-1/2T-1/2 Se f(v 00) > B(T). 


Notice that Assumptions 5.6 implies both that S1 (r) = S2(7) = S, and also, 
together with Assumption 5.7, that Fı (00) = F2(00) = F (8o). The form of the 
distribution in Assumption 5.8 can be motivated from 


[rT] 


= 1/2 [rT] 
PES (4,60) = | T | [mT]? S| fe 60) 


by noting that Abin” ~ m/2, and that the CLT implies [rT] 71? 
ser f (vt, 90) £N (0, S). There is one consequence of Assumption 5.8 which 
is worth highlighting. Since, 


[xT] 


T25 7 (vz, 00) = = Te ys (vz, 90) + Po D f (vz, 0o) 


t=1 t=[rT]4+1 


52 The name derives from the name of the first person to have recorded this type of unin- 
terupted irregular motion in a natural phenomenon. R. Brown was a botanist and he observed 
the phenomenon when pollen dispersed on water. His results were published in 1828; see 
Brown (1828). 

53 Recall that for any matrix A, || A|| = [#r(A’A)]?/?. 
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and Assumption 5.8 implies 


T 
SPT Y= fon 0o) => B0) 


t=1 
[xT] 
SMA X f(v bo) => Blr) 
t=1 
then it must follow that 
T 
SPT? XO flo 0o) = Ba(1) — B(T) (5.93) 
t=[rT]+1 


Hamilton (1994) [Sections 17.1-17.3, 18.1] provides a very good introduction to 
Brownian Motions and the conditions behind Assumptions 5.6-5.8. Davidson 
(1994)|Part IV] provides a more comprehensive treatment. 

With these assumptions in place, we now proceed to characterize the limiting 
behaviour of Wr(m) and Or(z) in terms of Brownian Motions and Brownian 
Bridges. Consider first Wr(z). The end result was first derived by Andrews 
(1993), but we follow the approach taken by Sowell (1996) which exploits the 
projection matrix structure inherent in the identifying restrictions. 

Since Wr(z) depends on T!/2[6, r(r) — 62,r()] and 


T7161 rr) — ĝa rr) = T! ĝi rr) — Oo] — T? [B2,r(x) — 90] (5.94) 


we begin by deriving expressions for T/?[4; (a) —0o]. To facilitate the analysis, 
we assume that the GMM estimators based on T; are consistent for all m. 


Assumption 5.9 Consistency of Sub-Sample Estimators 
supren||O;,r(m) — 9p]| 2, 0. 


We can now repeat exactly the same sequence of arguments as in Section 
3.4.2 to obtain the following analogs to (3.26) 


T! [f r(r)— 90] = -M rr) T? gir (00; 7) (5.95) 


Mirt) = [Gir (ĝira); r) Siran) Gi rÂ rlr), 0o, Ar; m)! 
x Gir (ôi rlr) ny Sir) 


and Gir(6ir(n), bo, Ar; 7) is defined in an analogous fashion to Gr(6r, bo, Ar). 
To proceed we adopt the following high level assumption.°* 


54 Andrews (1993) or Ghysels, Guay, and Hall (1997) for more primitive conditions under 
which Assumption 5.10 holds. 
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Assumption 5.10 Uniform Convergence of M;,7(7) 
supren||Mir (2) — Mol] > 0 where Mo = (GoS71Go)~!GyS7?. 
It then follows from (5.95), Assumptions 5.65.10 and (5.93) that 


Trla) 00) > ——(F(G0)'F (0) Fo) Blr) (5.96) 


x I , 
T? (ða r(T)— 0o) => ~<a [F (00) F (80)] F Go) 

x (Ba(1) — Ba(m)) (5.97) 
where once again we have set F(@)) = S~!/?Gy. The combination of (5.94) and 
(5.96)-(5.97) yields 
1 


16 n(n) — 62,r(m)] = Ta) 


[F (00) F (00) F (0) BBy(m) (5.98) 


Now consider Vw (7r). By similar arguments to above, it can be shown that 


Pwl) B FEOV EO) + IF (00) F (0) 
1 


= g OFO (5.99) 
The combination of (5.98)—(5.99) implies 
Wr(r) > Say Bal) P00) BB ym) (5.100) 


where once again P(69) = F(00)[F'(00)'F(00)| 1 F(00)’. Now P(69) is a projec- 
tion matrix with rank equal to p by Assumption 3.6. Therefore P(o) has p 
eigenvalues equal to one, q — p eigenvalues equal to zero, and there exists an 
orthogonal matrix H such that” 


P(6) = HAH (5.101) 


where A = diag(1p,0g-p) and 1, is a (p x 1) vector of ones. If we partition H 
into [H,, H2], where H; is q x p then (5.101) implies that 


H,H, 0 
P(8) = vee 102 
(oo) = | a | (5.102) 
If (5.102) is substituted into (5.100) then it follows that 
1 $ 
Wer(n) 4 ———BB,(r)'H| Hı BBy(r) (5.103) 
m(1— r) 


where BB,(m) denotes the first p elements of BB,(). Now by definition, H; 
are orthogonal matrices and so Hı H; = Ip. Therefore, Hı B(T) ~ N(0p, 7Ip) 
and so it follows that H,.B,(7) > B,(7) and hence that H,BB,(7) > BB,(r). 
This gives us the following result. 


55 See Dhrymes (1984) [Propositions 52 and 55, pp.61 and 65]. 
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Theorem 5.9 Limiting Distribution of W7(z) 
If Assumptions 3.1-3.5, 3.8-3.9, 5.6-5.10 hold then: Wr(.) = W/(.) where 
W(.) is a continuous time process associating each date m € II with the scalar 
W (r) = za- B Bpr) BBp(1). 
Now consider Or(7). By definition, we have 
Orr = lra) PaT rra) 7) 1? (5.104) 
Onr = |Sor(m)? (= [aT ])"2g2,7(02,r(m); m)? (5-105) 


and we can repeat the same sequence of arguments as in Section 3.4.3 to deduce 
the following sub-sample analogs to (3.35), 


Sirla) P nT] g rÂ rr); 7) Ni rlr) ra) 1? 


II 


x [aT]? g1, r (8037) (5.106) 
2 r(n) "(T — [nrT])! g2 r (ĝ2,r(7); 7) = Na r (1) r (r) (T a [nT})1/2 
92,7 (90; T) (5.107) 
where 
Nir) = h4- 871 Gir (ir (n), 00, Xr; T)[Gir (Âi r (r); T) Sira) 


xGir(ĝi T(r), bo, Ari 7)] t Gir (ê r(T); T) ir) 


for i = 1,2. As with M;,r, we must assume this matrix converges uniformly in 
T. 


Assumption 5.11 Uniform Convergence of N; r(r) 


suprem||Ni rr) Sirm"? — NoS~\/? || 2, 0 where No = Ha — P(80)]. 


To illustrate the argument from here on, it is most convenient to focus on only 
O;,r(m), and then to state the corresponding result for O2,r(7) afterwards. 
Assumptions 5.8 and 5.11 together with (5.106) imply that 


Orzr(n) => ly — POo)IBa(m)P (5.108) 
= LBi — P(69)|Bq() (5.109) 


Now, using (5.101)and H'H = I}, we have 
L — P(0o) = H'H — H'AH = H'll, — AJH 
= fo o 
~ |0 HH 
This result can be combined with (5.108)—(5.109) to deduce that 


1 
O1,7r(m) => — By—p()' Ha H2By—p(7) 
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where B,_,(7) is the vector consisting of the last q — p elements of B,(7). Since 
Hy is an orthogonal matrix, we can use the same reasoning as above to deduce 
that H2By_»(7) = Bg—p(m) and hence that 


1 
Oi,7r(m) => — Ba-p(7)' Bap) 
Similar reasoning yields 


Ox.r(m) > <——[By-v(1) ~ Be-p()I'[Byp(1) - Bo-p() 


=r 
So finally we obtain the following result for Or (r). 


Theorem 5.10 Limiting Distribution of Or(r) 
If Assumptions 3.1-3.5, 3.8-3.9, 5.6-5.9, 5.11 hold then: Or(.) = O(.) where 
O(.) is a continuous time process associating each date x € [0,1] with the scalar 


O(n) = + By—p(m)! Bg—p(7) + 14 [Bo—p(1) = Bg—p(7)]'[Ba-p(1) z Bg-p(m)]. 


Theorems 5.9 and 5.10 give the limiting behaviour of Wr(z) and Or(z) 
for v € II. The limiting distribution of the test statistics then follows directly 
from the Continuous Mapping Theorem. This theorem states that if Zr(.) => 
Z(.) and h(.) is a continuous functional, then h(Zr(.)) = h(Z(.)).°° Since 
Sup—, Av— and Exp— versions of the statistics involve continuous functionals 
of {Wr(7)} or {Or(7)}, we can use the Continuous Mapping theorem to deduce 
the following corollary to Theorems 5.9 and 5.10. 


Corollary 5.1 Limiting Distributions of Structural Stability Tests for 
the Unknown Break Point Case 

If Assumptions 3.1-3.5, 3.8-3.9, 5.6-5.11 hold then: (i) SupWr = SuprenW (r); 
(ii) AUWr => J, W(n)dJ (x); (iii) ExpWr => log{ fy exp[0.5W(7)|dJ (x); (iv) 
SupOr => SuprenO(m); (v) AvOr => fa O(n) dJ (r); 

(vi), ExpOp = log{ fy exp[0.50(m)]dJ(z). 


These results are presented in the following places: (i) Andrews (1993) [Theo- 
rem 3]; (ii)—(iii) Sowell (1996) [Theorem 3];°" (iv)—(vi) Hall and Sen (1999) [The- 
orem 3.1]. It is the critical points from these distributions with J (m) equal to 
the uniform distribution on II which are reproduced in Tables 5.5 and 5.6. 

One final comment is in order. It can be recalled from Theorem 3.5 that 
the parameter estimator (identifying restrictions) and the estimated sample mo- 
ment (overidentifying restrictions) are asymptotically independent if the model 
is correctly specified. This independence has already manifested itself in various 
other inference procedures discussed earlier in this chapter, and it is also present 
here. The sequence of statistics {W7(7)} are functions of the first p elements 
of B,(.), and the {Or(m)} are functions of the last q — p elements. Since, by 
definition, the elements of a Brownian motion are mutually independent, it fol- 
lows that the tests of parameter variation are asymptotically independent of the 
tests based on the overidentifying restrictions under H8S (II). 


56 For example, see Hamilton (1994) [p.482] and the discussion therein. 
57 Also see Andrews and Ploberger (1994). 
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5.4.3 Other Types of Structural Instability 


As mentioned in the preamble to this section, the single break point case has 
received by far the most attention within the GMM literature on structural 
stability. However, other types have also been considered and we now provide 
a brief review of these alternatives. 

An obvious extension of the single break point case is to allow for the pres- 
cence of multiple break points. To date this approach has not been developed 
in the context of GMM estimators. However, Bai and Perron (1998) have de- 
veloped methods in the context of linear regression models. One aspect of their 
results is particularly interesting. They show that if it is assumed that there is 
a single break point then the estimated fraction 7 is consistent for the fraction 
associated with one of the multiple break points. This enables them to propose 
an iterative procedure in which the researcher gradually increases the number 
of break points until the structural stability tests are no longer significant. To 
date, it is unknown whether this type of sequential estimation procedure works 
in the more general GMM framework. 

Hansen (1990) considers tests for H}5(({0,1]) against the alternative 
Elf (vr, 64)| = 0 and 


O = O-1 + h 


where m ~ i.i.d.(0,7?H:). Notice that if rT? = 0 then this model reduces to 
the null hypothesis. Interestingly, Hansen (1990) shows that the LM statistic 
against this alternative is well approximated by AvWr and so this statistic is 
likely to have good power properties against this alternative as well.°® 

More generally, Sowell (1996) provides a framework for the construction of 
optimal tests for parameter variation based on GMM estimators. His results 
provide a generic approach which can be specialized to the form of instability 
of interest. 

Finally, it should be noted that all these procedures rely on asymptotically 
large samples and so are unlikely to have good power properties against insta- 
bility at the very beginning or end of the sample. Dufour, Ghysels, and Hall 
(1994) propose a Generalized Predictive test which can be applied in this situ- 
ation. The null and alternative hypothesis are the same as the Predictive test 
except this time only Tı need be asymptotically large and Tọ may be as small as 
one observation. The statistic is based on {f(v,61,r(m)), t € T2} and not the 
sub-sample average. Since the focus is now the individual observations, it is not 
possible to use a conventional asymptotic analysis to deduce the distribution. 
One solution is to make a distributional assumption, but this is unattractive 
in most GMM settings. Therefore Dufour, Ghysels, and Hall (1994) consider 
various distribution free methods of approximating or bounding the p-value of 
their statistics. 


58 Hansen (1990) analysis is motivated by earlier work due to Nyblom (1989) in the context 
of Maximum Likelihood estimators. 
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5.5 Other Hypothesis Tests 


The foregoing tests are by far the most commonly used in the types of applica- 
tion listed in Table 1.1. However, certain other tests have been proposed and in 
this section we provide a brief review of these methods. The discussion covers 
non-nested hypothesis tests (Section 5.5.1), Hausman tests (Section 5.5.2) and 
conditional moment tests (Section 5.5.3). 


5.5.1 Non-Nested Hypothesis Tests 


So far, we have concentrated on methods for testing hypotheses about popu- 
lation moment conditions or parameters within a particular model. However, 
in many cases more than one model has been advanced to explain a particu- 
lar economic phenomenon and so it may become necessary to choose between 
them. Sometimes, one model is nested within the other in the sense that it can 
be obtained by imposing certain parameter restrictions. In this case the choice 
between them amounts to testing whether the data support the restrictions in 
question using the methods described in Section 5.3. Other times, one model is 
not a special case of the other and so they are said to be non-nested. There have 
been two main approaches to developing tests between non-nested models. One 
is based on creating a more general model which nests both candidate models 
as a special case; the other examines whether one model is capable of explaining 
the results in the other. Most of this literature has focused on regression models 
or models estimated by maximum likelihood. While these situations technically 
fall within the GMM framework, they do not possess its distinctive features 
and so are not covered here.°? Instead, we focus on methods for discriminating 
between two non-nested Euler equation models. These models involve partially 
specified systems and so involve aspects unique to the GMM in its most general 
form. 

We consider the case where there are two competing models denoted M1 
and M2. If M1 is true then the parameter vector 0, and the data satisfy the 
Euler equation 

Ey [ur (vz, 01)|Qe-1] =0 (5.110) 


where Q;—1 is the information available at time t— 1 and F;[.] denotes expecta- 
tions under the assumption that M1 is correct. For our purposes, it is sufficient 
to assume the Euler equation residual u1(v;,61) is a scalar. From (5.110) it 
follows that the residual is orthogonal to any (qı x 1) vector 21,4 E€ Q4-1, and 
this yields the population moment condition 


Ey [21,21 (ve, 91)] = 0 (5.111) 
Using analogous definitions, M2 leads to the (q2 x 1) population moment con- 
dition 

Eo[z2,22(ve, 02)] = 0 (5.112) 


59 These techniques are well described in the recent comprehensive review by Gourieroux 
and Monfort (1994). 
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where again the Euler equation residual is taken to be a scalar. It is assumed 
that the two models are globally non-nested in the sense that one model is not 
a special case of the other.©? Since both models can be subjected to the tests 
in Sections 5.1-5.4, there can only be a need to discriminate between them if 
both models pass all these diagnostics; so we assume this to be the case. 

As mentioned above there are two main strategies to developing non-nested 
hypothesis tests and each has been applied within the context of Euler equation 
models. Singleton (1985) proposes nesting the Euler equations of M1 and M2 
within the Euler equation of a more general model. Ghysels and Hall (19900) 
propose tests of whether one model can explain the results in another. We now 
describe these in turn. 

Singleton’s (1985) analysis begins with the observation that if M1 is false 
and its overidentifying restrictions test is insignificant then it must be because 
the test has poor power properties when M2 is true. Therefore, he proposes 
choosing the linear combination of the overidentifying restrictions which has the 
most power in the direction of M2. The problem is how to characterize this 
direction. Singleton (1985) solves this issue by introducing a more general Euler 
condition which is the following convex combination of those from M1 and M2, 


Eglet(@1, 02, w)|Qe_-1] = 0 (5.113) 


where 
€4(01, 02, w) = wur (vt, 01) + (1 —w) uz(v, 02) 


where 0 < w < 1 and Eg[.] taken with respect to the true distribution of the 
data under this more general model. Notice that w = 1 implies M1 is correct, 
and w = 0 implies M2 is correct. The other values of w imply a continuum 
of residual processes which lie between those implied by M1 and M2 in some 
sense. If w is replaced by a suitably defined sequence wr which converges to one 
from below at rate T!/? and z1, = 224 = 2, then 


Eg [zet (0i, 02, w)] =0 


defines a sequence of local alternatives to (5.111) in the direction of (5.112). 
Singleton (1985) shows that the linear combination of the overidentifying re- 
strictions in M1 which maximizes power against this local alternative is the 
transpose of 


Ar = SIr (s1.06.7) — g2,r(62,r)) 


where 9,7 (6:7) = T7! D ztUilVt, ĝi r), Sur is a consistent estimator of 
limp oo Var[T!/2.g1,7(61)| and 6;7 is the GMM estimator of 6;. This leads to 
the test statistic 


NNr(1,2) =T 91,7(61,7)' Ar (A Ei rAr)" Argir (61,7) 


60 See Pesaran (1987) for a formal definition of nested, partially non-nested and globally 
non-nested models. The distinction between the last two can be important but need not 
concern us here. 
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where yr = Sir = Gi rl(Ĝ Sp rer) C r and Gur = Agi. (1,7) / 06". 
Singleton (1985) shows that if M1 is correct then NN7(1,2) converges to a x7 
distribution. The roles of M1 and M2 can be reversed to produce the analogous 
statistic NN7(2,1) which would be asymptotically x? if M2 is correct. In fact, 
the test should be performed both ways and so there are four possible outcomes: 
NNr(1, 2) is significant but NN7(2,1) is not and so M2 is chosen; NNr(2, 1) 
is significant but NNņr(1,2) is not and so M1 is chosen; both NN7(1,2) and 
NNr(2,1) are significant and so both models can be rejected; both N Nr(1, 2) 
and N Nr(2, 1) are insignificant and so it is not possible to choose between them 
in this way. 

This approach is relatively simple to implement because it does not require 
any additional assumptions or computations beyond those already involved for 
the estimation of M1 and M2. Its weakness is that the convex combination 
of the Euler equations from M1 and M2 may not be the Euler equation of a 
well defined economic model.®! In such cases, it is unclear how a significant 
statistic should be interpreted. The only way to avoid this problem is to con- 
sider sequences of local alternatives to the data generation process implied by 
M1 which are in the direction of the data generation process implied by M2. 
However, this involves making the type of distributional assumption which the 
use of GMM was designed to avoid. 

Ghysels and Hall (19906) propose an alternative approach to testing based 
on whether one model can explain the results in the other.°? More specifically, 
the data are said to support M1 if 


T 
TS z2ru2(ve 02,7) — Eilez tuz (v, 92,7)] (5.114) 


t=1 


is zero allowing for sampling error. To implement the test it is necessary to 
know or be able to estimate the expectation term in (5.114). Unfortunately, 
this typically involves specifying the conditional distribution of v and so is 
unattractive for the reason mentioned above.®? Ghysels and Hall (1990) de- 
velop a test based on approximating the expectation using quadrature based 
methods, but we omit the details here. 

Both these statistics are clearly focusing on the overidentifying restrictions 
alone. It is possible to extend Ghysels and Hall’s (19906) approach to tests of 
whether M1 can explain the identifying restrictions in M2. Such a test would 
focus on whether the solution to the identifying restrictions in M2 is equal to 
the value predicted by M1. In other words, it would examine 


Êz T — E [6,7] 


61 For example, Ghysels and Hall (1990b) show that a model constructed by taking a convex 
combination of the data generating processes for v; implied by M1 and M2 does not typically 
possess an Euler equation of the form in (5.113). 

62 This general approach is often refered to as the encompassing test principle; see Mizon 
and Richard (1986). 

63 Furthermore Ghysels and Hall (1990b) show that a misspecification of this distribution 
can cause their statistic to be significant. 
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However, it would suffer from the same drawbacks as mentioned above and so 
we do not pursue such a test here. 

Neither of these approaches is really satisfactory. Singleton’s (1985) test is 
only appropriate in the limited setting where (5.113) is the Euler condition of 
a meaningful model. Ghysels and Hall’s (19900) test is always appropriate but 
requires additional assumptions about the distribution, and once these are made, 
it is more efficient to use Maximum Likelihood estimation.®* This contrasts with 
the more successful treatments of the hypotheses in Sections 5.1-5.4. In these 
earlier cases, the partial specification caused no problems, but it clearly does 
so for non-nested hypotheses. In one sense, these results are more important 
because they illustrate the potential limits to inference based on a partially 
specified model. 


5.5.2 Hausman Tests 


Hausman (1978) proposes testing a hypothesis on the basis of a comparsion 
of two estimators of the parameter vector. One estimator must be consistent 
under the null hypothesis but inconsistent under the alternative. The other must 
be consistent under both null and the alternative. The simplest illustration is 
one of the examples used by Hausman, and which is also the most common 
application of this approach to testing. Suppose we have a linear regression 
model and are suspicious that one of the regressors, x; + say, is endogenous. The 
null hypothesis that x; 4 is exogenous can be tested via a Hausman test which 
compares the OLS estimator with an IV estimator. The former is consistent 
only if £i is exogenous; the latter is consistent regardless. Clearly the difference 
between them converges to zero under the null, but some non-zero value under 
the alternative.©° 

It is readily recognized that this basic principle can be applied in a wide 
variety of settings. It is often applied in the context of Maximum Likelihood 
estimation to test if the specification is correct. To present this version of the 
statistic, let Or denote the MLE and 67 be a GMM estimator of 0o based on 
some population moment condition E[f (v+, 00)| = 0. The Hausman test statistic 
is then given by 


Hr = T (ôr — Gr) (Vr - Vr) (ôr - ĉr) 


where Vr and Vr are consistent estimators of the asymptotic covariance of Êr 
and Op respectively. Under the joint null hypotheses that the Maximum Like- 
lihood estimation is based on the correct model and E[f (v+, 0o)] = 0 is valid 


64 Although, full information maximum likelihood may be more computationally burden- 
some; see Ghysels and Hall (19908). 

65 This statistic is often refered to as the Wu—Hausman test because — to quote Nakamura 
and Nakamura (1998) — “it was Hausman [(1978)] who presented it in the form that led to its 
widespread use but Wu [(1973)] who presented it first.” [p.220]. See Nakamura and Nakamura 
(1998) for further discussion of the literature on these types of endogeneity test. 
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then Hausman (1978) shows that Hr converges to a x2 distribution. The alter- 
native hypothesis is that the model is not correctly specified but nevertheless 
E|f (vz, 0o)] = 0. 

Newey (1985a) extends this test principle to models estimated by GMM. 
Newey (1985a) derives a Hausman statistic based on the difference between 
GMM estimators obtained from two sets of moment conditions which may con- 
tain elements in common. Interestingly, he shows that if one estimator is ob- 
tained with the optimal weighting matrix then the asymptotic variance of the 
difference of the estimators has the same difference structure as in the Maxi- 
mum Likelihood case. However, this asymptotic variance may also be singular. 
In principle, this matter is easily fixed by using a generalized inverse in the 
construction of the quadratic form, and then comparing the statistic to the 
critical point from a y? distribution with degrees of freedom equal to the rank 
of Vr — Ve However, in practice, this adjustment is not so straightforward 
for two reasons. First, rank{plimr—oo(Vr — Vr)} may be difficult to deduce 
a priori. Secondly, and unlike inverses, generalized inverses are not necessarily 
continuous functions of the elements, and so additional conditions are needed 
to ensure that the generalized inverse of Vr —Vr converges in probability to 
the generalized inverse of plimp_, — Vr; see Andrews (1987). Both these 
problems may explain the infrequent use of this test in the types of applications 
listed in Table 1.1. 


5.5.3 Conditional Moment Tests 


All the statistics presented in Section 5.1, 5.2 and 5.4 test hypotheses about the 
the population moment conditions upon which estimation is based. This mirrors 
the majority of empirical applications mentioned in the introduction. In these 
types of application, the model is only partially specified and so it is desirable 
to base estimation on as much relevant information as possible. Therefore 
all available moment conditions tend to be used in estimation.” However, if 
the distribution of the data is known then the most asymptotically efficient 
estimates are obtained by using Maximum Likelihood. As shown in Section 
3.7.1, maximum likelihood amounts to GMM estimation based on the score 
function of the data. So, in this case there is no advantage to including any 
other moment conditions implied by the model. These other moment conditions 
can, however, be used to test whether the specification of the model is correct. 
This generic approach yields what have become known as conditional moment 
tests. 

Newey (1985) and Tauchen (1985a) independently introduce a general frame- 
work for conditional moment testing based on Maximum Likelihood estimators. 
To illustrate this framework, suppose that the conditional probability density 


66 This statement is formally justified in Chapter 6. 
87 The choice of moment conditions may be limited by other factors such as data availability 
or computational constraints. 
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of v; given {v4_1, U;_2,--- U1} is py(V4; ĝo), and so the score function is 
E[L+(4)] = 0 


where [;(09) = Olog(p:(v4;@0))/00. As mentioned above this is the moment 
condition upon which estimation is based. Now assume that if this model is 
correctly specfied then the data also satisfy the (q x 1) population moment 
condition E[g(v:, 0o)]| = 0. Therefore one way to assess the validity of the model 
is to test 

Ho: Elg(v:z,40)| = 0 
against the alternative 

Ha: Elg(v:z,90)] 40 


This hypothesis can be tested using the statistic 
T g T i 
CMr =T X g(r) QF X a Êr) (5.115) 
t=1 t=1 


where Êr is the maximum likelihood estimator, Qr is a consistent estimator of 
limp_.o. Var[T 1? D cr(09)], and 


cilo) = gil0o) — Elðgi(00)/30 EJL: (90) /30]} L: (20) 


Under Ho, C Mr converges to a xa distribution. The statistic has a similar struc- 
ture to the overidentifying restrictions test but there is an important difference. 
Since F[g(vz, 90)] = 0 is not used in estimation, the statistic has power against 
any violations of Ho; see Newey (19850). In spite of this, some caution is needed 
in the interpretation of the results. While a rejection of Ho implies the model 
is misspecified, a failure to reject only implies that the assumed distribution 
exhibits this particular characteristic of the true distribution. 

The choice of g(.) varies from model to model. For example, in the normal 
linear regression model, g(.) often involves the third and fourth moments of the 
error process; see Bowman and Shenton (1975). White (1982) suggests that one 
generally applicable choice is to base g(.) on the information matrix identity, 


E[L¢ (90) Lt(9)'] = — EOL, (90) /06"] 


because if the the null hypothesis cannot be rejected then conventional formulae 
for Wald, LR and LM statistics are valid. Consequently, this approach has been 
explored in many settings; for example, see Chesher (1984) and Hall (1987a). 
Various other examples are provided by Newey (1985b) and Tauchen (1985a). 


5.6 Summary 


This chapter has presented a number of inference procedures that can be used 
to learn about the underlying model. The discussion focused on four main types 
of hypothesis test within the GMM framework: the overidentifying restrictions 
test; an overidentifying restrictions based test for the validity of a subset of 
the population moment condition; Wald, D and LM tests for testing whether 
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the parameter vector satisfies a set of nonlinear restrictions; structural stability 
tests based on both the identifying restrictions and also the overidentifying 
restrictions. The limiting behaviour of these test statistics is derived under 
the appropriate null hypothesis. The power properties are analyzed using either 
local or non-local alternatives, and these two approaches are contrasted. A brief 
review is also provided of other less common hypothesis tests within the GMM 
framework such as non-nested tests, Hausman tests and conditional moment 
tests. 

In the preamble to this chapter, it is observed that three types of inference 
questions arise in practice. These are — Is the model correctly specified? Does 
the model satisfy restrictions implied by economic/statistical theory? Which 
of two competing models is correct? We now briefly summarize what has been 
learnt about how these questions can be addressed. 


e Is the model correctly specified? Misspecification can take two basic forms. 
First, the model can be misspecified in the sense described in Chapter 4 
that is, E[f (v+, 0)] is the same for all t but there is no value of 0 that makes 
this expectation zero. Secondly, the model can be structurally unstable 
so that E[f (v+, 0o)] = 0 for some part of the sample but not for all of it. 
The overidentifying restrictions test is designed to test against the first 
of these types of misspecification, and is consistent against this type of 
alternative. The overidentifying restrictions test has power against certain 
types of misspecification due to structural instability but is not consistent 
against all forms of structural instability. This type of misspecification 
can be detected using specially designed structural stability tests. The 
latter tests can be based on either the identifying restrictions, in which 
case they amount to tests for parameter variation, or the overidentifying 
restrictions. 


e Does the model satisfy restrictions implied by economic/statistical theory? 
In many cases of interest, the restrictions implied by economic theory take 
the form of a set nonlinear restrictions on the parameter vector. Such 
restrictions can be tested using Wald, D or LM tests. 


e Which of two competing models is correct? Assuming that both models 
appear correctly specified on the basis of diagnostics decribed above, the 
answer then depends on the relationship between the two models. If they 
are nested, in the sense that one is obtained by imposing a set of parameter 
restrictions on the other, then the choice between them can be based on 
the Wald, D or LM statistics for testing the validity of the restrictions 
in question. However, if the models are non-nested then this becomes a 
far harder question to address within the the types of model in Table 1.1 
without the specification of the probability distribution of the data. 


All the inference procedures described above are based on asymptotic the- 
ory. However, as noted at the outset, asymptotic theory is only used as an 
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approximation to large sample behaviour. It is therefore important to investi- 
gate how good this asymptotic approximation is to finite sample behaviour in 
the kinds of circumstance encountered in practice. This topic is addressed in 
the next chapter. 


6 


Asymptotic Theory and 
Finite Sample Behaviour 


So far, all the analysis has rested on asymptotic theory. This approach has 
been taken for two good reasons. First, to date, it has proved impossible to 
develop a finite sample distribution theory for GMM estimators in nonlinear 
dynamic models. Secondly, as we have seen, asymptotic analysis delivers a 
very powerful inference framework. However, there is inevitably a price to be 
paid. All the asymptotic results are only strictly valid in the limit as T — oo, 
and so represent an approximation to finite sample behaviour. The question 
to which we now turn is: how good is this approximation? Intuition suggests 
that the answer varies from case to case, and so one goal of this chapter is to 
identify what aspects of the specification determine the quality of the asymptotic 
approximation. 

Since finite sample distribution theory is intractable for nonlinear dynamic 
models, this question has been addressed in this context via computer based 
simulation studies calibrated to match models of particular interest. These 
studies form the main focus of this chapter and are reviewed in Section 6.3. 
However, we precede our review of these simulation studies with a discussion of 
two relevant aspects of the theoretical literature. 

First, since many of the simulation studies examine the consequences of in- 
creasing the number of moment conditions, it is useful to consider what can be 
learnt about these consequences from asymptotic analysis. Section 6.1.1 consid- 
ers the case in which there is a finite increase in the degree of overidentification. 
It emerges from this analysis that such an increase can never have a detrimental 
effect on the asymptotic distribution of the estimator. However, there are some 
circumstances in which there is no effect, and so the additional moment condi- 
tions are said to be redundant. This scenario turns out to be pertinent to our 
discussion of the aforementioned simulation studies, and so a formal definition 
of redundancy is provided in Section 6.1.2. Given the potential asymptotic ben- 
efits from increasing the degree of overidentification, it is natural to consider an 
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estimation strategy in which this degree is allowed to increase with the sample 
size. In Section 6.1.3, it is shown that there are potential gains from such a 
strategy but these can only be reaped if the degree of overidentification does 
not increase too quickly with the sample size. 

The second relevant aspect is the theoretical literature on the finite sample 
behaviour of the GMM estimator in static models. While it is true that finite 
sample distribution theory has proved intractable for nonlinear dynamic mod- 
els to date, this is not the case for the IV estimator in the linear regression 
model discussed in Chapter 2. Although the exact finite sample distribution is 
not easily interpreted, its form does reveal the aspects of the specification upon 
which it depends, and these are summarized in Section 6.2.1. Further insights 
are gained by considering higher order approximations such as Edgeworth ex- 
pansions for the distribution of the estimator or so called Nagar expansions for 
the finite sample bias and mean squared error of the GMM estimator. Both 
methods have been applied in the context of the linear simultaneous equations 
model, but the second has recently been employed very fruitfully to examine the 
bias of GMM estimators in nonlinear static models. Section 6.2.2 summarizes 
the main insights gained from both these analyses. Although these results only 
apply to static models, intuition suggests that if a factor of the specification 
effects the quality of the asymptotic approximation in static models then the 
analogous factor has a corresponding effect in dynamic models. At the same 
time, it would be anticipated that the presence of dynamics introduces addi- 
tional complications. 

As mentioned above, Section 6.3 reviews the insights gained from a number of 
simulations studies calibrated to the types of models underlying the applications 
in Table 1.1. Finally, Section 6.4 pulls together the evidence from the preceding 
three sections to provide an overview of what factors appear to affect the quality 
of the asymptotic approximation. These factors are also used to motivate the 
topics addressed in the following two chapters. 


6.1 The Impact of the Degree of 
Overidentification on the Asymptotic 
Behaviour of the Estimator 


Theorems 3.1 and 3.2 establish the consistency and asymptotic normality of Êr. 
Inspection of these results reveals that they hold for any population moment 
condition satisfying certain regularity conditions of which the most important, 
for our purposes here, are the orthogonality condition in Assumption 3.3 and the 
identification condition in Assumption 3.4. In most cases, the underlying model 
implies multiple choices of f(v, 0o) which satisfy these conditions. Therefore, 
it is important to consider how these asymptotic properties are affected by the 
expansion of the set of population moment conditions upon which estimation is 
based. We split the analysis into three parts. Section 6.1.1 considers the case 
in which there is a finite increase in the number of moment conditions, Section 
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6.1.2 introduces the concept of redundant moment conditions and Section 6.1.3 
considers the case in which the number of moments increases with T. 


6.1.1 Finite Increase in the Degree of Overidentification 


To facilitate the analysis, it is necessary to introduce the following notation. 
We partition f(.) into f(vz, 0Y = [fi (vt, 8)’, fa(ve, 0)'] where fi(.) is (qi x 1) and 
q = qı + is finite. Now let br be the (optimal) two step GMM estimator 
based on 

Elfi(ve,90)] = 0 (6.1) 
It is assumed that ĝo is identified by (6.1) and hence that qı > p. Finally, let 
Êr denote the (optimal) two step estimator based on 


Elf(vr,40)| = 0 (6.2) 


It is straightforward to invoke Theorems 3.1 and 3.2 in order to deduce that 
both estimators are consistent for 09 and 


TV?(6,.7—9%)  N(0, V1) (6.3) 
TY? (ôr 309) Ss NOV) (6.4) 


where V = (GOS~'Go)~!, Vi = (Gi oS11G1,0)7t, S and Go are defined as 
before,' S1, is the (qı x q1) upper left hand block of S, and G4 o is the (qı x p) 
matrix comprising the first qı rows of Go. Therefore the only difference between 
the two limiting distributions lies in their variance. The following theorem 
establishes the relationship between V and V4. 


Theorem 6.1 Asymptotic Efficiency and the Inclusion of Additional 
Population Moment Conditions 

If (i) Assumptions 3.1-3.5 and 3.17-3.13 hold; (ii) rank(Gi,0) = p (itt) ĝi r is 
the (optimal) two step GMM estimator based on (6.1); (iv) Êr be the (optimal) 
two step GMM estimator based on (6.2); then Vı — V is positive semi-definite 
and so Îr is asymptotically at least as efficient as ĝi r. 


The regularity conditions are needed to ensure that (6.3)-(6.4) hold. The proof 
rests purely on showing that Vı — V is positive semi-definite. 

Proof: 

Since V and V; are positive definite, Vı — V is positive semi-definite if V71 — vot 
is also positive semi-definite.” The latter difference is more convenient to work 
with, and is our focus here. To this end, partition Gp and S$ into 


1 See Section 3.4.2. 

2 See Dhrymes (1984) [Proposition 65, p.76]. Strictly, Dhrymes only establishes the result 
for the case in which the difference in positive definite, but his proof is easily amended to 
cover the case in which the difference is positive semi-definite. 
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Using the partitioned matrix inversion formula,’ it follows that 


STi (a + $12AS2,157 1) -S719124 


-1 _ 
as -AS21571 A 


(6.5) 


where A = (53,2 — 92,1571 91,2)7!. The substitution of both (6.5) and the 
partition of Go into D = V~!—V,' yields 


D = cen ees + $1,2AS2,1871)G10 = Gp AS2,1571G1,0 
—G' 987191,24G2,0 + G29AG20 — G19S71G1,0 


Multiplying out this expression, it can be verified that D = B'AB where B = 
$2,157 {G10 — G2. Now S is positive definite (p.d) by assumption, and so S~+ 
shares this property, which in turn implies A is p.d. Therefore, D is positive 
semi-definite by construction, and so we have established the desired result. 
° 

This result makes intuitive sense. The elements of the population moment 
condition can be viewed as pieces of information about 69 and, from this perspec- 
tive, Theorem 6.1 can be paraphrased as saying that more correct information 
never hurts. For the puposes of our later discussion, it is useful to examine the 
circumstances under which it does not help either. This is the topic of the next 
sub-section. 


6.1.2 Redundant Moment Conditions 


Breusch, Qian, Schmidt, and Wyhowski (1999) use the term redundancy to 
describe the situation in which the augmentation of the population moment 
condition has no effect on the asymptotic variance of the estimator. This idea 
can be expressed formally as follows. 


Definition 6.1 Redundant Moment Condition 

Let V denote the asymptotic variance of the GMM estimator based on E| fı (vt, 
60)| = 0, E[f2(vt, 90)] = 0, and let Vi be the corresponding variance when esti- 
mation is based on E|f,(v;,00)] = 0 alone. The population moment condition 
E|fo(vt, 90)| = 0 is said to be redundant for 09 given E|fi(vz,90)] = 0 if Vi = V. 


Intuition suggests that E[f2(vz,@0)| = 0 is redundant given E/fi(v:,40)| = 0 if 
it provides no information about ĝo beyond that already in E| fı (vz, 00)] = 0. To 
formalize this intuition, it is necessary to first characterize the part of f(v, 0o) 
which cannot be explained by f;(v;,09). To this end, it is assumed that the 
Central Limit Theorem can be applied to deduce that 
T/? 91 r (00) d 0 Sii S12 
| T!/29> r(00) > w(| 0 j | Si y ) (6.6) 


3 See Magnus and Neudecker (1991) [p.11]. 
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where T!/2g; (00) = T2 EZ filur 0o) for i = 1,2. It follows from (6.6) 
that the conditional distribution of T'/?g2,7(09) given T!?g1,r (00) is given by 


N ($2.15717"?q1,0(60), 52,2 — $2.15; 151.2) 


Given the form of this conditional distribution, the unexplained part of 
T'/? 99 7(00) is given by 


T 
TV 99 -7(80) — 5218717 /? 91,760) = T? X r(v, 60), say, 


t=1 


where r(vz, 90) = fo(vz, 90) — S219711 fi (vt, 00). Therefore, we now focus on the 
residual, r(v:, ĝo). At this stage, it is useful to recall three aspects of our discus- 
sion of local identification in Section 3.1. First, the local information about 6o 
contained in a moment condition is captured by the expectation of its derivative 
with respect to 0. Secondly, the moment condition uniquely determines 0o if 
this expected derivative is full rank. Thirdly, if this expected derivative is rank 
deficient then the moment condition provides some information about @9 but 
not enough to determine it uniquely. Taken together these three points imply 
that the moment condition is only completely uninformative if the expected 
derivative is zero. Therefore, E[f2(v:z,@0)] = 0 provides no local information 
about 09 beyond that in E[fi(vz,@0)] = 0 if and only if 


E[Or(v,, 00) /00"| = 0 


This condition is one of three for redundancy provided by Breusch, Qian, Schmidt, 
and Wyhowski (1999). The other two are less intuitive but may be easier to 
verify in practice. For completeness, we reproduce all three here, but omit the 
proof.4 


Lemma 6.1 Conditions for Redundancy 

The following statements are equivalent. (A): E|fo(v1,@0)| = 0 is redundant 
given El fi(vz,90)| = 0. (B): E[Or(vz, 00) /06'] = 0. (C): ElOfa(v:, 90)/00'] = 
$2157 1 E[Of1 (vt, 00)/06']. (D): There exists a (qı xX p) matrix A such that 
E| fı (ve, 00)/30"] = S114 and E[O fa (v, 0) /06"] = S214. 


The concept of redundancy proves useful in understanding some of the sim- 
ulation results described in Section 6.3. 


6.1.3. The Degree of Overidentification Increases with the 
Sample Size 


If we follow Theorem 6.1 to its logical conclusion, then it leads us to an esti- 
mation strategy in which we include as many population moment conditions as 
possible. For a given sample, q must be less than T. However, as T increases, 


4 Also see Section 7.1. 
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Theorem 6.1 appears to suggest that it may be advantageous to allow q to in- 
crease as well — in other words, to adopt a strategy in which the number of 
population moment conditions is gr and qr — œ with T. In spite of its intu- 
itive appeal, this logical step must be taken with caution because Theorem 6.1 
is premised on Theorem 3.2, and the latter only holds for fixed q. To date, there 
have been only a few studies which shed light on the asymptotic behaviour of 
the estimator when p is fixed but gr — oo with T — oo. This evidence suggests 
that the asymptotic theory derived in Chapter 3 may be valid if qr — p increases 
fairly slowly, but is unlikely to be so if gr — p increases too rapidly. It should be 
noted, however, that all these studies consider the issue in the context of i.i.d. 
data. It is left to future research to consider whether these rates of increase for 
qr translate to dependent data. We now briefly summarize the main results on 
this issue. 


Newey (1990) examines the limiting behaviour of the IV estimator in the con- 
text of a nonlinear simultaneous equations model under the assumption that the 
error, u;(09), is conditionally homoscedastic given z+. This restriction is impor- 
tant because it implies that the optimal weighting matrix is Sate in (2.29), and 
so proportional to (T~!Z’Z)~1, the choice assumed in Newey’s (1990) analysis.® 
He shows that Theorem 3.2 continues to hold provided qr = 0(T!/?). Koenker 
and Machado (1999) consider only linear models but allow for the possibility 
that uz(@0) may be conditionally heteroscedastic and so S' is estimated by Ssu 
in (3.40). They show that qr — œ and qr = o(T*/3) are sufficient conditions 
for Theorem 3.2. This rate is rather slow, and so implies a more limited scope 
for an estimation strategy based on an expanding set of moment conditions. 
Interestingly, this slow rate appears to stem directly from the behaviour of Ssu. 
However, this rate is sufficient and not necessary, and as such is a lower bound 
on the possible rate of increase for qr. 


If gr increases faster than the rates given above then this impacts on the 
limiting behaviour of the estimator in some way. Morimune (1983) considers 
the limiting behaviour of the 2SLS (IV) estimator in the context of the linear 
simultaneous equation model. He shows that if qr increases at rate T!/? then 
the estimator is consistent, T!/2(67 — 6o) has a limiting normal distribution 
but the mean of this distribution is not zero. He further shows that if qr 
increases at rate T then the estimator is inconsistent. Bekker (1994) derives the 
limiting behaviour of the IV estimator in the case where the equation of interest 
is linear in the parameters and qr increases with T. Using 0 to denote the 
probability limit of Êr, Bekker (1994) shows that (T — p)!/2(6p — 0) converges 
to a normal distribution with mean zero but a different variance than the one in 
Theorem 3.2. 


5 The main focus of Newey’s (1990) study is actually the construction of optimal instru- 
ments, a topic that is considered in Chapter 7. 

6 Note that the dimension of Sr is qr and so increases with T. This case is outside the 
settings reviewed in Section 3.5 for which qr = q. 
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6.2 Finite Sample Theory for Static Models 


This section describes the insights gained from two theoretical frameworks for 
learning about the finite sample behaviour of GMM estimators in static mod- 
els. Section 6.2.1 describes the available exact finite sample results for GMM 
estimators. Section 6.2.2 summarizes results derived using higher order approx- 
imations based on Edgeworth and Nagar expansions. 


6.2.1 Exact Results for the IV Estimator in the Linear 
Simultaneous Equations Models 


There has been a considerable literature on the finite sample distributions of 
estimators in the static linear simultaneous equations model.’ Here we focus 
exclusively on the results for the IV estimator described in Chapter 2. 

For our purposes in Chapter 2, it suffices to specify just the equation of in- 
terest and make certain broad assumptions about the interrelationship between 
the variables. Here it is necessary to be more specific. Accordingly, we now 
assume the equation of interest is a member of the simultaneous system 


YB+\Tr=U (6.7) 


in which Y is the (T x J) matrix of of observations on the J endogenous variables, 
N is the (T x K) matrix of observations of the K exogenous variables and U 
is the (T x J) matrix of errors. It is assumed that the tt” row of U, Us, is a 
vector of independent random variables with zero mean and covariance matrix 
X whose typical element is o;,;,9, and which is independent of Us, for all s # t. 
Without loss of generality, we focus attention on the the IV estimator of the 
parameters in the first equation of the system. To this end, we partition Y and 
N as follows: Y = [y1, Yi] and N = [Nj, N2], where yı is the (T x 1) vector of 
observations on the first endogenous variable in the system, and N; is (T x K;) 
with ¢” row N! ,, and Kı + Ko = K. The first equation of the system can then 
be written as 

yı =Z Yı bo F NiYo + Uy (6.8) 


where u; = U_; is the first column of U. The reduced form of the system in 
(6.7) is given by 


Y = NI+A 

where I = -T B-t! and A = U B™!. It is convenient to write this reduced form 

as 
Ih Ih, 

VS NW ; 214 la, A 6.9 

W, Yi) = [| M WG al [a1, Ai] (6.9) 

Below it is necessary to refer to the reduced form error variance, and so we let 

wi jo denote the i — jt” element of Qo = Var[at] where a; is the t” row of 


lay, Ay]. 


T See Phillips (1983) or Bowden and Turkington (1984, pp.137-44) for a survey of these 
results. 
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If we set X; = [Y1, Ny] and 6’ = [6', 7] then (6.8) can be written as 
yı = X100 + u1 (6.10) 


Equation (6.10) can be recognized to be of the same generic form as the model 
in Chapter 2 with y = yı, X = Xı and p = J+ Kı — 1. As in that earlier 
setting, the observations on the instruments are contained in the T x q matrix 
Z where 

Z = [Ni, NoCo] = NC; 


where K > q > p, and Cz, C2 are selection matrices. Two aspects of this 
instrument choice should be noted. First, the instruments taken from the set of 
exogenous variables that appear in the system. Secondly, the instrument matrix 
always includes the exogenous variables from the equation being estimated. 
The results discussed below are based on the assumption that N is fixed in 
repeated samples. In this case, it is easily verified that u1 satisfies the “Classical 
assumptions” listed in Assumption 2.5, and so the “optimal” two step estimator 
is just the two stage least squares estimator. We therefore focus on this version 
of the IV estimator, that is 


a | Br =Ò KPK) Ea (6.11) 
YT 
where P, = Z(Z Z)! Z. 

Phillips (1980) derives the exact distribution of Êr in the case where U;,, pos- 
sesses a normal distribution. The resulting expression is extremely complicated 
and — to quote Phillips himself — “not as easy to interpret as we would like” .? 
Therefore we do not present the precise details here. Instead, we abstract to 
more general level and use Phillips’s (1980) result to examine what aspects of 
the specification affect this distribution. To simplify this discussion, we restrict 
attention to the case in which J = 2 and C, = Ik — therefore the system consists 
of only two equations and all the exogenous variables are used as instruments. 
It is worth noting that prior to Phillips’s work, the finite sample distribution 
of Br had been derived for certain special cases, and one of these is for J = 2. 
So, since we limit attention to this case, our discussion can take advantage of 
insights gained from the earlier studies by Richardson (1963), Sawa (1969) and 
Anderson and Sawa (1973, 1979).'° 

Using the aforementioned results, it can be shown that the finite sample 
distribution of Br depends on the following aspects of the specification: 


e Go, the true parameter value. 
e q—>p, the degree of overidentification. 


8 See Section 2.4. 
9 See Phillips (1980) [p.870]. 

10 Notice that Richardson (1963) and Phillips (1980) employ normalizations so that variance 
of the reduced form error and instrument cross product matrix are both identity matrices. 
These restrictions facilitate the analysis but also must be borne in mind when considering 
how the properties of the instruments effect the distribution. 


210 Finite Sample Behaviour 


e yu, the concentration parameter, 


u? = w33 olg o[N3 N2 — N3N1 (N1 N1) N, N2]Il2 2 (6.12) 


e Xo, the covariance matrix of the errors. 


It is not surprising that the distribution depends on both 6o and Xo, but for 
our purposes here, there is little to be learnt from exploring the nature of their 
impact on the distribution. It is the roles of q — p and u? which provide the 
most useful insights. Most of these insights have been revealed by numerical 
calculations, but there is one interesting facet which can be deduced directly 
from the form of the distribution. Phillips’s (1980) analytical result reveals that 
the finite sample moments of Êr only exist up to the order q — p.!! Anderson 
and Sawa (1979) evaluate the distribution of Êr numerically for a wide variety 
of parameter settings. In general terms, their results suggest the following con- 
clusions ceteris paribus. As q — p increases the finite sample distribution tends 
to be negatively skewed and to exhibit less variation than would be predicted 
by the asymptotic distribution. In other words, as q — p increases the distribu- 
tion becomes increasingly concentrated about some point away from the true 
value. In contrast, increases in u? tend to offset both these effects, although 
the distribution still exhibits less variation than would be anticipated from the 
asymptotic approximation. 

Since all our statistical theory is based on T — oo, two questions naturally 
arise: — at what sample size does asymptotic theory provide a good approxi- 
mation? — and on what aspects of the specification does this depend? It can 
be recalled from Theorem 3.1 that Êr is consistent and so as the sample size 
increases the distribution of Br must collapse onto Bo. Whereas Theorem 3.2 
states that T!/?(Gp — Bo) converges in distribution to a normal random vector. 
In fact, both behaviours only occur if u? — o0.!? Therefore this is the route 
through which T affects the distribution, and this relationship can be made 
explicit by rewriting (6.12) as 


w= Twz3 olls (M22 — M2, M7} Mi 2)Ib2 = Ti’, say (6.13) 


where Mi; j = T-1N, Nj. Equation (6.13) reveals an interesting feature of the 
passage from finite sample to asymptotic behaviour: it is not T per se that 
matters, but TA. Therefore, ji? effects the sample size at which asymptotic 
theory manifests itself. In particular, notice that if ñ? is very close to zero then 
the passage from finite sample to asymptotic behaviour is likely to be slow. 
Since all our inference rests on asymptotic theory, it is important to gain a 
better understanding of what ñ? ~ 0 implies about the specification. This is 
most readily achieved by considering the extreme case in which ji? = 0. Clearly 


11 Such a relationship had previously been conjectured by Basmann (1961, 1963). 

12 See Anderson and Sawa (1979) [p.174] or Phillips (1983) [footnote 10, p.470]. It is this 
behaviour which gives ju? its name : as u? — oo the distribution of Br becomes increasingly 
concentrated around ĝo and collapses onto this point in the limit. 
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this condition holds if either I2 2 = 0 or M22 — MziMyiMi2 = 0. However, a 
more instructive answer can be obtained by relating these two conditions back to 
the condition for identification in this model. It can be recalled from Section 2.1 
that the condition for identification is rank{E|[z,2,]} = p. For our model here, 


Elz x, = E  ( y ) (one Nia) | 


z| Ni tY2,t NNi: 
Na ty2t NotNi4 


II 


where - since J = 2 — we set Yı = y2 and yz is the (T x 1) vector with tt” element 
Y2, Using this substititution in (6.9) and the properties of az, it follows that 
, (ll Ni iN; Il MAN; 
Elaz,) = E | NN 124 Nit ae ee 
NoN; phe + NotNo.Ho2, NorNi 4 


Inspection reveals that this matrix has rank less than p if either Hz. = 0 or 
Nə is an exact linear function of N1 +. Notice that the second condition implies 
that Nə = N,H and so 


Ro = M22- MoiMy}Mi2 = T'H'NI NH — H'N{ Ni(N, N1) N, N, H] 
=0 


Therefore, the conditions for 69 to be unidentified are exactly the same as those 
for ji? = 0.18 The re-emergence of the condition for identification here is not 
surprising, because it is fundamental to our ability to estimate 09 from the pop- 
ulation moment condition. However, this analysis also adds a new facet to our 
understanding of the relationship between the two. If either Ilo 2 or R21 is very 
close to zero then E[z;uz(@)] may be very close to zero for 0 4 0o. In this case 
0 is said to be “weakly identified” by E[z,u,(00)] = 0. Under these conditions, 
ji? is also likely to be small and so the estimator converges slowly toward the 
behaviour predicted by asymptotic theory.!4 Anderson and Sawa (1979) report 
evidence that this convergence is further slowed down by increases in q — p. 
They conclude that “the desirable asymptotic properties of the 2SLS estimator 
are not necessarily expected to be relevant to the cases that appear in practice, 
that is, the sample size being at least 50 but less than 100 and the number of 
excluded exogenous variables” — q— p in our notation — “being more than 10 but 
less than 50”(Anderson and Sawa (1979) [p.175]). It is important to remem- 
ber their time of writing when interpreting what sample sizes are “relevant” in 
practice. Nevertheless, their conclusions give us an indication of circumstances 
in which the asymptotic approximation may not be accurate. 

The above discussion provides insights into the nature of the finite sample 
distribution. It is also useful to have similar insights for specific features of 
the distribution such as the mean and variance. Hillier, Kinal, and Srivastava 


13 This assumes w2,2,0 < 00. 
14 Also see Section 8.2. 
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(1984) derive exact formulae for the moments of the IV estimator under nor- 
mality. These formulae can be used to calculate the bias and mean squared 
error but are sufficiently complicated to be uninterpretable.!° However, it is 
possible to develop more revealing expressions if we are prepared to settle for 
approximations to the finite sample moments. This topic is discussed in the 
next sub-section. 


6.2.2 Higher Order Approximations 


The asymptotic analysis in Chapters 2 through 5 is often refered to as “first 
order” asymptotics. This terminology originates from the idea of expressing the 
statistic of interest, cr say, as a polynomial expansion in negative powers of T 
such as 

Cr = Co + eT? + oT + caT S2 pas 


The limiting behaviour of cr is governed by the lead or first term of the expan- 
sion co, and this gives rise to the terminology. As mentioned above, these first 
order asymptotics only provide an approximation to finite sample behaviour. 
Intuition suggests that a better approximation can be obtained by including 
higher order terms from this expansion. In this section, we review the litera- 
ture on two types of higher order expansions for GMM estimators: Edgeworth 
expansions for the distribution function, and Nagar expansions for the bias and 
mean square error. 

Edgeworth expansions provide a bridge between the finite sample and lim- 
iting distributions, and by examining their lead terms it is possible to uncover 
what factors affect the passage to the limiting distribution. Sargan and Mikhail 
(1971) and Sargan (1975) derive the Edgeworth expansion for the IV estimator 
in the static linear simultaneous model in (6.7) with normal errors.‘® For our 
purposes here, it is sufficient to focus on the case in which J = 2 and so there 
are only two endogenous variables. In this case, Sargan and Mikhail (1971) 
show that 


1/23 
Eee) <r| = (r) 4+ iG) + EPE + Op(T™/?) 
AVar(ĝÊr) vT s 

(6.14) 
where Êr is the estimator defined in (6.11), AVar(Êr) is the asymptotic variance 
of Êr, ®(.) is the cumulative distrbution function of the standard normal distri- 
bution, and D;(.), i = 1,2 are constants that depend on the model.!” It can be 
recognized that the first term on the right hand side of (6.14) is the probability 


15 Knight (1986) derives exact formulae for the moments of the 2SLS estimator when the 
errors follow an Edgeworth type distribution but these expressions possess the same advantages 
and disadvantages as their counterparts when the error has a normal distribution. 

16 Also see Morimune (1983). 

17 Sargan (1975) extends the analysis to the case where Êr — Po is standardized by the 
square root of the estimated asymptotic variance. 
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of the particular event based on the asymptotic distribution of the estimator. It 
therefore follows that the use of the limiting distribution to calculate such prob- 
abilities involves an error of order O,(T~!/?). More can be learned about this 
error by examining the determinants of D;(.). Sargan and Mikhail (1971) report 
calculations that indicate the asymptotic approximation tends to deteriorate as 
the degree of overidentification increases. This leads them to conclude that: 


“by using an intelligent choice of instrumental variables, little may 
be lost in the asymptotic variance of the estimator and a good deal 
may be gained in the decreased error of the asymptotic approxima- 
tions.” [Sargan and Mikhail, 1971, p.158] 


In terms of more modern terminology, this conclusion can be stated as saying 
that the inclusion of redundant or nearly redundant instruments tends to lead 
to a deterioration in the quality of the asymptotic approximation. 

Nagar (1959) develops expansions for the first two moments of the Two 
Stage Least Squares estimator in the linear simultaneous equations model with 
normal errors. Although the approximations have been generalized subsequently 
to certain other distributions,'® it is most convenient to maintain normality and 
also to continue to restrict attention to the case in which J = 2 — all notation 
is the same as in the previous sub-section. 

Nagar (1959) derives a random vector, bz, and a random matrix, M,, such 
that: 


Ôr —% = b: + 0,(T~*) (6.15) 
(ôr — 00)(6r — %)' = M: + 0,(T~?) (6.16) 
He then approximates the bias (first moment) of Ôr by E[b.] and the mean 


square error matrix of Êr by E[M,].!° This leads to the following approxima- 
tions: for the bias (up to the order of T7!) 


Elbz] = (q—-p—1)Qzs (6.17) 


= [85] 
Op—1 
B; is the matrix satisfying A; = U B1, o, is the first column of © and 0, is the 


r x 1 null vector; and for the mean squared error (up to order T~?) 


E[Mz] = 01Q. (I+ A*) (6.18) 


where Q; = (X'P,X)7}, 


where 


AX = [-2(q — p — 1)tr (Q.H,) + tr (Q-Hs)] - Ip 
+ {{(q-p)? -3(q-p) +4] Ho - (4 - p - 2)Hs} Q+, 


18 See Buse (1992), Donald and Newey (2001) and Peixe and Hall (2000). 

19 While this step has an obvious intuitive appeal, it is not valid in all circumstances; see 
Srinivasan (1970). However, Sargan (1974) establishes a set of conditions under which it is 
valid in the context here. 
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H, =07;'ss' and 
Fay By UB, 0,-1 
Op-1  Og-1)x(p-1) 
and 0;., is the r x r null matrix. 

These two approximations can be used to explore the impact of the inclusion 
of additional instruments upon the first two moments of the estimator. Inspec- 
tion of (6.17) reveals that the approximate bias depends on Z via q — p and 
Qz. Therefore, the bias is sensitive to both the number of instruments and their 
relationship to y2. The bias is also different for each element of Êr. To gain 
a better understanding, it is convenient to focus on m, = ||E[bz]|| which can 
be interpreted as an aggregate measure of bias in the estimation of 09. Buse 
(1992) derives a relatively simple condition for m, to increase when additional 
instruments are included in the estimation. To present this condition, define 2; 
and Zə to be respectively (T x qı) and (T x q2) matrices of instruments and 
assume Z; represent the first qı columns of Z2. This means qı < q2. For what 
follows, it is also important to recall that by assumption the first Ay columns 
of any instrument matrix contain the explanatory variables, N1, which appear 
in the equation being estimated. If m; equals the value of m, associated with 
Z = Z; then Buse (1992) shows that 


R3- RG 
R?— R? 


m2 


my 


(6.19) 


IA IL IV 
aa 
mee 

IA IL IV 


where R? represents the uncentred R? from the regression of y2 on Z;, and RŽ 
represents the R? from the regression of y2 on N;. Therefore the approximate 
bias will increase with the number of excess instrumental variables only if the 
proportional increase in the number of instruments is faster than the rate of 
increase in R? measured relative to the fit of Yı on N,;. This means that the 
potential impact of additional instruments depends on the explanatory power 
of those already included. 

While it is desirable to avoid bias, an increase in bias may be tolerated if 
the mean squared error is reduced. Unfortunately, the formula in (6.18) is not 
so amenable to interpretation. However, it can be used to numerically evaluate 
how the inclusion of new instruments affects the approximate mean squared 
error. Peixe and Hall (2000) report this type of calculation for the special case 
of the model described above in which J = 2, Kı = 1, Ko = 8. More specifically, 
they consider the system 


YW = Y2 +n, +u (6.20) 
yo = Ny2+ U2 (6.21) 
in which b =S 1, YI Ns fo a Er NV 5 .03, 72,6 =... = 72,9 = .33. These 


choices imply the first five columns of N have only a marginal contribution to 
the explanation of y2, but the last four variables have a more significant impact 
so that the population R? for (6.21) is around 30%. To reflect this dichotomy, 
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we refer to {n;,i =1,...5} as “bad” instruments (for y2), and {n;, i = 6,...9} 
as “good” instruments (for y2). Strictly none of these instruments are redun- 
dant but there is clearly a sense in which the “bad” instruments can be viewed 
as “nearly” redundant given the “good” ones.2° The error specification is as 
follows: letting u; denote the t” element of u;, then up = (ui, u2)" is in- 
dependently and identically distributed as a normal random vector with mean 
zero and a variance—covariance matrix Ug whose diagonal elements are one and 
off-diagonal elements are 0.8. The sample size is set at T = 30. Table 6.1 re- 
produces the calculated values of the approximate bias and mean squared error 
for various instrument combinations reported in Peixe and Hall (2000). There 
are five cases each involving four instruments.?! The only difference between 
the five cases lies in the number of good and bad instruments included. As 
would be expected, the approximate bias and MSE decrease every time a bad 
instrument is replaced by a good one. Table 6.1 also reports the percentage 
change in bias and MSE if an additional instrument is included. The results 
reveal that if the additional instrument is “bad” then both the bias and mean 
squared error increase. However, if the additional instrument is “good” then the 
impact on the bias is more subtle. If only one of the four instruments is good 
then the inclusion of another good one reduces the bias. Whereas, if at least two 
of the four are good then the inclusion of another good one increases the bias. 
Also the size of this increase is an increasing function of the number of good 
instruments. In spite of this, the inclusion of an additional good instrument 
always decreases the mean squared error. While caution must be exercised in 
generalizing the specific results to more general settings, one conclusion is clear. 
There is a far more complex relationship between the behaviour of the estimator 
and the properties of the instrument vector in finite samples than is predicted 
by asymptotic theory. 


Table 6.1 
Impact of an additional instrument 


Inst. Bias %+1G %+1B MSE %+1G %+1B 


3B1G 0.478 -24.08 48.80 0.546 -27.36 3.66 
2B2G 0.243 0.25 49.36 0.390 -17.09 1.90 
1B3G 0.163 12.59 49.57 0.319 -12.45 1.25 

4G 0.122 49.75 0.277 0.98 


Source: Peixe and Hall (2000). 

Notes: Inst denotes the composition of the benchmark set of instruments: e.g. 3B1G denotes 
three bad instruments and one good one. Bias and MSE are calculated using (6.17) and (6.18) 
respectively. % + 1G (% + 1B) denotes the percentage change in either the bias or MSE as a 
result on the inclusion of an additional good (bad) instrument. 


20 For ease of expression here, we attribute the property of redundancy directly to the 
instrument and not the associated population moment condition. 

21 Notice that this is the smallest number of instruments for which the second moment of 
the estimator exists within this model. See the comments made earlier in this section. 
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Newey and Smith (2004) develop Nagar type expansions for the bias of both 
the two step GMM and continuous updating GMM estimators in nonlinear 
static models. To present these results, we return to our generic notation and 
so assume that estimation of ĝo is based on E[f (v+, 0o)| = 0. The data vector, 
vt, is assumed to be a realization from some independently and identically dis- 
tributed process. Let Or denote the two step GMM estimator and 67 denote the 
continuous updating GMM estimator.?? Also define G; = Of(v,9)/00'|o—6,, 
fi(O) = f(v, 0) and let f;i(0) denote the it” element of f;(0). Newey and 
Smith (2004) show that the approximate bias of the GMM estimator is given by 


El6r] — 9 = T7!{B; + Bg + Bs + Bw} + o(T7}) (6.22) 
where 
Br = M(S~'){E[G,M(S~')f,(00)] — a} 
Be = —(GoS~1Go)~1E[G,S-V/? {I} — P(00)} S7"? fr (60) 
Bs = M(S71)E[f:(0)fe(90)’S~1/? {Iz — P(00)}8~*/? f:(90)] 
Bw = -M(S~') $ E a i ia) [M(W) — M(S~')/'e, 


a is (q x 1) vector with it element, 


a; = ost fias aoe 


0 fri (90) 
0000" 

M(W) = (GhWGo)-!G\W, P(0o) = F(9){F (0o) F(00)] 7 F (00), F(00) = 
S71/2Go and ej is a (p x 1) vector whose jt” element is one and remaining 
elements are all zero. As Newey and Smith (2004) observe these four compo- 
nents of the bias have an interesting interpretation. To motivate this part of 
the discussion, it is useful to first recall the Method of Moments interpretation 
of GMM derived from the first order conditions, that is the two step GMM 
estimator is the MM estimator based on Gy S~!E[f(v:,0)] = 0.23 If both Go 
and S are known, then the GMM estimator is just the value of 0 that sets this 
linear combination of the sample moments equal to zero, that is ðr, the solution 
to Gs ~19r(0r) = 0. It is easily recognized that this version of the estimator 
converges to the same limiting distribution as the two step estimator. We thus 
refer to ðr as an infeasible optimal GMM estimator — infeasible as Go and S are 
unknown, optimal in the sense that it is a minimum variance estimator based 
on E|f (vz, 90)| = 0. With this in mind, we now consider the components of the 
bias in turn. By; is the approximate asymptotic bias of the infeasible optimal 
GMM estimator; Ba is a bias term that arises due to the estimation of Go; Bg is 
a bias term that arises due to the need to estimate S; By is a bias term arising 
from the first step estimator. Two other general features of this decomposition 


22 See (3.102) in Section 3.7. 
23 See Section 3.3. 
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are worth noting. First, if the parameter vector is just identified then Bg, Bs 
and By are all equal to zero. Therefore, overidentification introduces bias from 
a variety of sources. Secondly, it is interesting to note that these sources of the 
bias depend on the same features of the model that play a crucial role in the 
limiting distribution of the GMM estimator in misspecified models.?4 

Newey and Smith (2004) show that the corresponding bias of the continuous 
updating GMM estimator is given by 


El6r] — 9 = T {Br + Bs} + o(T7!) (6.23) 


In comparison to (6.22), it can be seen that there are fewer sources of bias. 
Specifically, there are no longer bias terms associated with the first step estima- 
tion or the estimation of the derivative matrix. The absence of the first of these 
is to be expected because there is no longer a first step estimation. The sec- 
ond is less easy to explain from a GMM perspective. Newey and Smith (2004) 
show that the absence of Bg is to be expected because the continuous updating 
GMM estimator is a member of the class of Generalized Empirical Likelihood 
estimators. However, further elaboration here would constitute a major detour, 
and so the interested reader is refered to Newey and Smith (2004).?° 

While these general formulae provide some useful insights into the bias, the 
specific form of the terms is difficult to interpret. Newey and Smith (2004) spe- 
cialize these formulae to three cases of interest: the IV estimator in the linear 
model described in Chapter 2; the Generalized IV estimators described in Sec- 
tion 7.2; and separable moment conditions, that is f(vz,0) = fı (v+) — fo(@0). In 
all cases, Newey and Smith (2004) show that the bias of the GMM estimator in- 
creases with the number of overidentifying restrictions ceteris paribus — however 
some caution is needed in making such comparisons as noted by Buse (1992) 
because the introduction of additional moment conditions alters other aspects 
of the model.?© Imbens (2002) reports a similar calculation for a very simple 
example in which only one moment condition provides information and all the 
remaining moment conditions are redundant. He shows that the approximate 
bias increases linearly with the number of redundant moment conditions. 


6.3 Simulation Evidence from Nonlinear 
Dynamic Models 


As we have just seen, finite sample distribution theory provides some useful in- 
sights into what aspects of the distribution affect the quality of the asymptotic 
approximation in static models. Intuition suggests that these aspects of the 
specification are going to play a similarly important role in nonlinear dynamic 
models. At the same time, it would also be anticipated that the presence of 
nonlinearity and/or dynamics introduces additional complications. In recent 


24 See Chapter 4. 
25 There is a brief introduction to empirical likelihood estimators in Section 10.2. 
26 See discussion earlier in this section. 
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years, concern has grown about the adequacy of the asymptotic approximation 
in the sample sizes encountered in practice, and this has spawned a number 
of computer based simulation studies calibrated to the types of model which 
appear in Table 1.1.27 An overview of these studies is provided by Table 6.2. 
In this section we review the main findings from this literature. 


Table 6.2 
Simulation studies of the finite sample properties of GMM 
Economic or statistical Type 
topic 
Asset pricing NV, NP Tauchen (1986), Kocherlakota (1990), 
Hansen, Heaton, and Yaron (1996), 
Smith (1999) 
LV, NP Ferson and Foerster (1994) 
Business cycles NV, NP Burnside and Eichenbaum (1996), 
Christiano and den Haan (1996) 
Covariance structures NV, LP Altonji and Segal (1996) 
NV, NP Clark (1996) 
Inventories LV, LP Fuhrer, Moore, and Schuh (1995), West 
and Wilcox(1994,1996) 
Stochastic volatility NV, NP Andersen and Sørensen (1996) 
Note: Type indicates the functional form of the model with NV (LV) denoting nonlinear 


(linear) in variables and NP (LP) denoting nonlinear (linear) in the parameters. 


As in the previous section, there are two main questions of interest here 
— does asymptotic theory provide a good approximation in the sample sizes 
encountered in practice? — and, what aspects of the specification affect the 
quality of this approximation? The answer to the first question is going to be 
model specific, but the answer to the second is likely to be generic on some level 
and so is our main focus here. In spite of this, it is pedagogically more convenient 
to organize the discussion around four specific studies. We begin with the 
studies by Tauchen (1986) and Kocherlakota (1990) which are calibrated to the 
consumption based asset pricing model used in our empirical example. We then 
briefly summarize the results reported in Hansen, Heaton, and Yaron (1996) 
for a slightly more sophisticated version of this model. Finally, we consider the 
study by Andersen and Sørensen (1996) based on the stochastic volatility model 
described in Section 1.3.5. Together these four studies provide a good overview 
of the qualitative findings from this literature. 

Asymptotic theory has been used to justify the GMM estimation and also 
to develop a vast array of inference procedures based on the estimator. In our 
discussion here, we focus on how well this theory approximates finite sample 


27 As an illustration of the level of this interest, the July 1996 issue Journal of Business 
and Economic Statistics has a special section devoted to seven papers on this topic. 
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behaviour of the two most important components of this framework: the esti- 
mator r and the overidentifying restrictions test Jr. Specifically, we consider 
the following five questions: 


1. Is the GMM estimator approximately unbiased? 
2. How reliable are confidence intervals based on asymptotic theory? 


3. Is the finite sample distribution of the overidentifying restrictions test well 
approximated by a Xap ? 


4. How does iteration affect the answers to 1.—3.? 


5. How does the use of the continuous updating estimator affect the answers 
to 1.—3.? 


Apropos the fourth question, it can be recalled from Section 3.6 that iteration 
beyond the second step has no effect on the asymptotic distribution and was 
proposed purely because of potential gains in finite samples. At that stage in our 
discussion we could only anticipate some advantage, now we can learn whether 
these gains are realized in practice. With these five questions in mind, we now 
turn to the simulation evidence. 

Tauchen (1986) examines the behaviour of GMM in Hansen and Singleton’s 
(1982) version of the consumption based asset pricing model.?® His design 
assumes there is only one asset, and estimation is based on the population 
moment condition, 


E[zrut(90)] = 0 (6.24) 
where 
ur(0) = d(ce41/cr)?~* (regs /Pt) 
7 = (1, Ct/Ct—1;. -Ct L/c L 1,1t/Pt eee ee ye A 


The degree of overidentification is controlled by L, and Tauchen considers the 
cases L = 1,2,3,4. Notice that L = 2 gives the instrument vector used in 
our estimation of the model earlier in the text. Two sample sizes are consid- 
ered: T = 50,75. A large part of Tauchen’s (1986) contribution is to have 
developed a method for generating artificial data consistent with the underly- 
ing model. However, we only comment very briefly on this aspect of his study. 
To this end, note that the asset return, rz41, is given by rez. = peyi + dt+1 
where d:4 ; denotes the dividends paid out during the period. Therefore, the 
model can be viewed as depending on three stocastic variables: c+, d+, and 
pz. Tauchen generates data on the first two of these variables from a VAR(1) 
model for [In(ce41/cz), ln(di41/dz)]. Given this data and the Euler equation for 
t= 1,2,...T, it is possible to solve for {p+}. Tauchen reports results for various 
choices of parameters in the VAR; he sets y = 0.3,1.30 and 6 = 0.97. The 
secondly step weighting matrix is ben defined in (3.40). It should be noted that 


28 See Section 1.3.1. 
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Tauchen only considers the two step estimator because that was the conventional 
practice at his time of writing. So his results cannot help us with questions 4 or 
5 above. However, in terms of the other three questions, his study reveals the 
following answers in order. 


1. Bias: For L = 1 (i.e. q— p = 1) the estimator is approximately unbiased, 
but there is a tendency for the bias to increase as L, and hence q — p, 
increases. At the same time, increases in L reduce the variance and so Ôr 
becomes concentrated at a value away from the truth. Interestingly, this 
mirrors Anderson and Sawa’s (1979) finding for the IV estimator in the 
linear model discussed in the previous section. 


2. C.L.’s: For L = 1,2 (i.e. q—p = 1,3) the empirical coverage of the asymp- 
totic confidence intervals is approximately equal to the nominal value.?’ 
However, for L = 3,4 (i.e. q— p = 5,7) the empirical coverage tends to 
be less than the nominal value. 


3. Jp: the empirical size of the overidentifying restrictions test tends to be 
close to its nominal value in all cases considered.®° If anything, the test 
rejects slightly less frequently than would be anticipated from asymptotic 
theory. 


Based on this evidence, Tauchen recommends that q — p be kept less than or 
equal to three in this model with these sample sizes. However, there is one 
aspect of Tauchen’s (1986) study which should be borne in mind when consid- 
ering this recommendation. The degree of overidentification is controlled by L, 
and so an expansion of the instrument vector involves the inclusion of lagged 
values of consumption growth and the asset return from further back in time. 
Now uz(@) depends on (cCt+1/Ct, rt+1/pt), and within his design the autocorre- 
lations of these variables decays as the lag length increases. Therefore, as L 
increases z; becomes augmented with variables whose association with u;(@) 
is decaying. In other words, every increase in L introduces instruments whose 
quality is worse than those already included. This is not a criticism of Tauchen’s 
(1986) design because this strategy is commonly used for instrument selection 
in Euler equation models in practice. However, it is probably more appropri- 
ate to view Tauchen’s recommendation within the context of this instrument 
selection strategy than as a more general comment about the desirable degree 
of overidentification per se. 

Kocherlakota (1990) uses Tauchen’s (1986) simulation method to investigate 
the behaviour of GMM in Hansen and Singleton’s (1982) model with multiple 
assets. In this case, estimation is based on the population moment condition 


Elz @ u(00)] = 0 (6.25) 


29 «Empirical coverage” is the term used for the proportion of the replications in which 
the calculated confidence interval contains the true parameter value. So for a 95% confidence 
interval, say, to be perfectly accurate, its empirical coverage must equal its nominal value, 
which is 95%. 

30 «Empirical size” is term given to the proportion of replications in which the test is 
significant. 
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where u;(0) is (s x 1) vector whose it element is 6(cr41/c¢)” 1 (ri t+1/Ppi,+) and 


zt is (k x 1) vector of instruments. Kocherlakota considers models with up to 
three assets, i.e. s = 1,2,3, and k = 1,3,4. However, of the seven particular 
combinations chosen, six involve q — p = 1 and one involves q — p = 6. Since 
the design involves multiple assets, Kocherlakota is able to confine attention to 
instrument vectors whose elements come from the set {1, Ct/Ct—1, Ti, t/Pit-1} — 
in other words, L = 1 in terms of the notation used to describe Tauchen’s (1986) 
study. Unlike Tauchen (1986), Kocherlakota (1990) evaluates the performance 
of both the two step and iterated estimator — the latter with Imaz = 70 — with 
Sr(i) = Ssu for i> 1. Two other differences between the two studies are also 
worth noting: Kocherlakota (1990) sets T = 90 for the most part but also reports 
results for T = 200, 500, 2000; he also sets y = 13.7 and 6 = 1.139.3! We begin 
our discussion of his results with the case in which T = 90 because these most 
closely parallel Tauchen’s settings. As a whole, Kocherlakota’s (1990) evidence 
suggests that the iteration beyond the second step considerably improves the 
quality of the asymptotic theory as an approximation to finite sample behaviour. 
So strong is the evidence that he focuses entirely on the iterated estimator in 
the published version of his paper — and so our discussion of his results must do 
the same. In terms of the other three questions, his findings are as follows. 


1. Bias: There is evidence of bias in some cases, and not others. This bias 
does not appear to be linked to the degree of overidentification per se, 
that is to the limited extent this can be assessed within this design. 


2. C.I.’s: The empirical coverage of the asymptotic confidence intervals is 
too low in nearly every case and in some cases the strikingly so — e.g. 
=~ 60% instead of the nominal value of 95%. 


3. Jr : The quality of the asymptotic approximation is good in some cases 
but not in others. In the latter, the empirical size of the test tends to be 
around 20% when the nominal size is 5%. 


Buried within this summary is an interesting pattern to the results. Although 
the choices s = 1,k = 3 and s = 3,k = 1 both imply q — p = 1 the estimator 
behaves very differently in the two cases. If there are multiple assets and one 
instrument (s = 3, k = 1) then the finite sample behaviour is well approximated 
by the asymptotic theory, but if there is one asset and multiple instruments 
(s = 1,k = 3) then the estimator is biased, the asymptotic confidence inter- 
vals are unreliable and the overidentifying restriction test rejects too frequently. 
Therefore, low values of q — p are no guarantee that the asymptotic approxima- 
tion is good. 

One attractive feature of Kocherlakota’s (1990) study is that he also consid- 
ers what happens as T increases. As T moves from 90 to 200, 500 and finally 


31 The parameter values are calibrated to replicate certain features of annual data for the 
U.S spanning 1889-1978. In contrast, Tauchen’s (1986) parameter settings were chosen to be 
“reasonable” from an economic theoretic standpoint. It should be noted that Kocherlakota 
(1990) also reports a limited number of simulation results using data generated with other 
parameter values including y = 0.3, 6 = 0.97 which were used by Tauchen. 
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2000, the quality of the asymptotic approximation improves. However, in the 
worst cases, it is only at the largest sample size that asymptotic theory ac- 
curately predicts the empirical coverage of the asymptotic confidence intervals 
and the empirical size of the overidentifying restrictions test. While this is not 
an encouraging conclusion, Kocherlakota finds that the situation is worse with 
the two step estimator. He finds that after only two steps, the overidentifying 
restrictions test converges very slowly to its asymptotic distribution. 

Clearly, some aspect of the asymptotic theory is not providing a good ap- 
proximation to finite sample behaviour. It would clearly be useful to diagnose 
where the problem lays, and Kocherlakota (1990) provides some useful guidance 
in this direction for the overidentifying restrictions test. To describe what he 
did, we must remind ourselves of the structure of the estimated sample moment 
again. Equation (3.35) shows that 


WAPT gr (Or) = Nr(6r)Wo?T2g7(00) = ÑrT™?gr(0o), say (6.26) 


It can be recalled from Section 3.4.3 that the asymptotic normality of We! z 
T!/2gr(ĝr) rested on the convergence in probability of Ñr to a matrix of con- 
stants and the application of the Central Limit Theorem to T!/?g7(0). Inter- 
estingly, Kocherlakota (1990) finds that T = 90 is large enough for T!/?g7- (00) to 
be approximately normally distributed in all the cases he considers. The prob- 
lem stems from Ñr. Kocherlakota (1990) finds that all the cases in which the x? 
approximation is poor are exactly the cases in which Nr is still exhibiting con- 
siderable variability. Since Nr is the product of matrices, Kocherlakota’s (1990) 
evidence points to two possible culprits: Sp 1 and Gr(6r). Interestingly, this 
evidence highlights two of the sources of bias in the Nagar type expansion for the 
GMM estimator described in the previous sub-section; see equation (6.22). The 
involvement of Gr(6r) here also creates an interesting tie in with our discussion 
of the finite sample distribution of IV estimator in the static linear model. It can 
be recalled from the previous section that the convergence of the IV estimator 
to its asymptotic distribution depends on the concentration parameter, TA?, 
and that this convergence is likely to be slow if 0) is “weakly” identified. Now 
the matrix G'r(Êr) has a similar link to identification because it is the sample 
analog of Go.°? This suggests that weak identification may be one source of the 
problems noted in Kocherlakota’s (1990) study — an explanation which would 
certainly accord with our empirical experience of the model in Chapter 3.33 
Before we move on to discuss the other two studies mentioned above, it is 
worth reflecting what we have learnt from Tauchen’s (1986) and Kocherlakota’s 
(1990) results about the interpretation of our empirical results. It can be recalled 
that choice of z; is a special case of Tauchen’s (1986) design with L = 2, and 
one in which he found asymptotic theory provided a reasonable approximation 
even in his much smaller sample sizes. However, Kocherlakota’s (1990) study 
reveals that the quality of the approximation can be sensitive to 0) as well as 


32 Recall that the condition for local identification is rank(Go) = p; see Assumption 3.6 in 
Section 3.1. 
33 In particular see the discussion in Section 3.6. 
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other aspects of the data generation process. In particular, it seems reasonable 
to be concerned about the quality of the identification and how it has affected 
finite sample behaviour. So it may be premature to draw a line under the results 
obtained so far, and we return to this example in the next two chapters as we 
explore various methods for improving inference based on GMM estimation. 

Hansen, Heaton, and Yaron (1996) also examine the behaviour of GMM and 
its associated statistics in a consumption based asset pricing model. However, 
their study builds from those described above in two important ways. First, 
they allow for time non-separability in the utility function of the representative 
agent. Second, they simulate the behaviour of continuous updating estimator 
as well as the two step and iterated estimators. 

Hansen, Heaton, and Yaron (1996) consider the case in which the represen- 
tative agent’s utility function takes the form, 


(ct + noce_1)+~7 — 1 
1 — 40 

Notice if no = 0 then this utility function reduces to the CRRA utility function 
used by Tauchen and Kocherlakota.*+ The agent is assumed to invest in two 
assets: a bond, whose payoff is denoted Rı z, and a stock, whose payoff is 
denoted R2,.°° Hansen, Heaton, and Yaron (1996) simulate artificial data from 
this model for a number of scenarios of empirical relevance.’ For brevity, we 
focus on two scenarios here: in the first, the data generation process is calibrated 
to annual US data and the sample size is set to T = 100; and in the second, the 
data generation process is calibrated to monthly US data and the the sample 
size is set to T = 400. In both cases, they consider the case in which estimation 
is based on the population moment condition 


Elz Q et+2(80)] = 0 (6.27) 


where z E€ %, 0 = (y, 8, n)’, 6 is the discount factor?”, and e¢42(0) is the (2 x 1) 
vector with it element given by*® 


Ci tne) Ci tne) 
eit+2(0) = 1 + ôn ———— — b6Rit41 4 ——— 
Ce + Ct-1 Ct + Ct-1 


2 | Coo +n | 
=. Sy) ee 
Ce + Ct-1 


Ula) = 


Two choices of instrument are used: 214 = (1,¢¢/c:-1)/ and 292 = (214, Riz, 
R24)’ for which q — p equals 1 and 5 respectively. 


34 Tf no = 0 then utility is time separable in the sense that utility in period t depends on 
consumption in period t; otherwise, utility is said to be time non-separable because utility in 
period t depends on both contemporaneous and lagged consumption. 

35 In terms of the notation above, Roz = (pt + dt)/pt-1- 

36 Hansen, Heaton, and Yaron (1996) use a variation on Tauchen’s (1986) method to sim- 
ulate the data. 

37 See Section 1.3.1. 

38 Tt should be noted that et+2(0) is a transformed version of the Euler equation associated 
with this model. 
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Hansen, Heaton, and Yaron’s (1996) evidence for the two step and iterated 
estimators tends to corroborate the findings from the previous studies. There- 
fore, we do not report specific details save to note that they find asymptotic 
theory tends not to be a good guide in samples T = 100; whereas, it is rea- 
sonably accurate for the iterated estimator with T = 400 for the model with 
q— p = 1, but not in the model with q — p = 5. Instead, we focus our discus- 
sion on how their results illuminate the relative properties of the iterated and 
continuous updating estimators. The most striking feature of this comparison 
is that the continuous updating estimator converges to very extreme values in a 
small but significant number of the replications whereas the iterated estimator 
does not.°? This behaviour is the source of two key differences between the 
simulated distributions of the estimators. First, the simulated distribution of 
the continuous updating estimator exhibits far longer tails than those of the 
iterated estimator. Secondly, these extreme values are not evenly distributed 
between the left and right tails and so cause an asymmetry in the simulated 
distribution of the continuous updating estimator which does not appear to be 
present for the iterated estimator. Both these features manifest themselves in 
the moments of the simulated distribution, and so impact on the comparison of 
the estimators. For example, if bias is measured as the difference between the 
true value and the median of the simulated distribution then in most cases — but 
not all — the continuous updating estimator exhibits less bias than the iterated 
estimator. However, if the median is replaced by the mean in the previous cal- 
culation, then in most cases the ranking is reversed. This tail behaviour leads 
Hansen, Heaton, and Yaron (1996) “from the standpoint of obtaining estimates, 
we see no particular advantage to using continuous updating when minimizing 
GMM criterion functions” [p.278]. However, they also note that the use of the 
continuous updating estimator may be advantageous for inference. Specifically, 
they find that the overidentifying restrictions test based on the continuous up- 
dating estimator tends to exhibit empirical size closer to its nominal value than 
its counterpart based on the iterated estimator. It is worth noting that this 
conclusion regarding the relative merits of the iterated GMM and the continu- 
ous updating estimator appears to be in conflict with that based on their Nagar 
type expansions; see (6.22)—(6.23) in the previous sub-section. These differing 
conclusions may reflect the different contexts: the Nagar expansions are for 
static models and the simulation results are for a dynamic model. Further work 
is needed to reconcile the results from these two approaches. 


The simulation studies described above shed little light on what factors effect 
the behaviour of Se 1. To gain some insight into this question, it is useful to 
recall both the form of the long run variance and the estimators. It can be 
recalled from Section 3.5 that 


S= + > Ui +T) 
w=1 


39 Also see Section 3.7. 
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and our basic strategy for estimating this matrix is to use a weighted sum 
of the sample autocovariance matrices.*? So there are two natural questions: 
— what factors affect the convergence of the sample autocovariances to their 
population counterparts? -— and what factors affect the convergence of our 
weighted sum of autocovariances to S? The answer to the first depends on 
the nature of the nonlinearity in f(v:,0). The answer to the second depends in 
part on the weighted sum involved. We have reviewed the extensive literature 
on covariance matrix estimation already, and below we discuss some further 
simulation evidence on this issue. However, before that, it is useful to expand 
a little on the answer to the first question. 

For the purposes of this discussion, we can confine attention to polynomial 
powers of a scalar random variable v,. For simplicity, assume that {v,; t = 
1,2,...7} is an independent sequence and v4 ~ N(0,1). As we have seen, the 
GMM estimation strategy exploits the convergence in probability of sample to 
population moments, that is 


TX o 2, Ev] = pp, say. (6.28) 


t=1 


While this result holds for any k, the variability of the sample moment depends 
on k in a rather simple — but striking — fashion. It is straightforward to show 
that 


T 
Var[T $ vf] = of/T 

t=1 
where o? = Var[v¥], and under our assumptions it follow that o? = 1, o2 = 2, 
o2 = 15, o? = 96, o? = 945 and so on.*! So, for example, T7! S v} exhibits 
96 times as much variability as the sample mean in any sample size! Or put 
another way, the variance of the sample mean is 0.1 when T = 10, but it takes a 
sample of size 960 to achieve the same precision for T71 5S vf. These simple 
calculations indicate that the convergence of sample moments is very sensitive 
to the form of the nonlinearity. This example is not without practical relevance 
either. Polynomial powers naturally occur in the population moment conditions 
used in many studies, and these calculations provide a simple intuition behind 
the findings in a number of the simulation studies listed in Table 6.2. 

We now turn our attention to a simulation study of GMM in stochastic 
volatility models which involves moment conditions of the type in (6.28) and 
also HAC estimators. Andersen and Sørensen (1996) consider the following 
simplified version of the model in Section 1.3.5 


Y = V Tte 
ln(x) = 6, +64gln(ay-1) + Ozu: 


40 For the purposes of this discussion, we exclude SV ARMA which is not considered in any 
of the simulation studies listed in Table 6.2. 

41 For the standard normal distribution, E[v?] = (k — 1)(k — 3)...3.1.; see Johnson and 
Kotz (1970) [p.47]. 
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where (e+, u+) ~ i.i.d.N (0, Iz).4? Note that this version of the model is used both 
to generate the data and is also the assumed specification for the estimation. 
Therefore, there are only three parameters to be estimated: @ = (61, 62,63)’. It 
can be recalled from Section 1.3.5 that the normality assumptions yields an infi- 
nite number of possible population moment conditions. Andersen and Sørensen 
(1996) consider estimation based on various permutations of the moment con- 
ditions but for our purposes here it is sufficient to concentrate on just four 
choices. Below we just list the moments of y involved; the exact form of the 
associated population moment condition can then be deduced from (1.48).*° To 
this end, we define mi = |y,|*, for i = 1,2,3,4; mi = E[lyey—a|], for i = 4 + j, 
j =1,2...10; mi = Ely?y?_,], for i = 14 + j, j =1,2,...10. The four sets of 
population moment condition are then given by: 


M5 : ml, m2, m4, m6, m15 

M9 : ml — m5, m7, m9, m16, m18 
M14 : ml -— m4, m6, m8, m10, m12, m14, m15, m17, m19, m21, m23 
M24 : ml — m24 


Andersen and Sørensen (1996) report results for the two- and three-step estima- 
tors.44 They consider different choices of 0 for the parameter generation, and 
sample sizes of T = 500, 1000, 2000, 4000 and 10,000. While the latter may seem 
large numbers, they are not uncommon in the the high frequency data to which 
these models are applied. In spite of these sizes, Andersen and Sørensen (1996) 
report that their numerical algorithm experienced non-convergence problems in 
the smaller sample sizes; further details of the source of these problems and how 
they were addressed can be found in their paper. 

Andersen and Sørensen (1996) report results for various choices of kernel 
and bandwidth in HAC estimator. We begin, as they do, with the case in which 
a Bartlett kernel is used with br = 10. In terms of our four questions above, 
the results suggest the following: 


1. Bias: There are quite substantial biases at T = 500 but tend to disappear 
quickly as the sample size increases. The bias tends to be smallest with 
M9 at T = 2000 and with M14 for the larger samples. 


2. C.I.’s: For T > 1000, the empirical coverage is reasonably close to the 
nominal value if M9 or M14 are used. However, if M24 is used then the 
studentized coefficient — that is (Or; — 9,1) /s-€-(Ôr i) — exhibits a marked 
leftward skewness even at T = 10,000. 


3. Jr : The results reveal an interesting pattern. As the number of moment 
conditions increase the distribution of Jr shifts to the right, and, for a 
given set of moment conditions, the distribution shifts to the left as T 


42 This model can be obtained using the following restrictions in (1.45)—-(1.47): y(tt) = Yt, 
a(t) = xt, dt = 1,7 = 02 —1,a=0, B 1, y = 0, 6 =61, Ç = 03 and p= 0. 

43 Note that in this simple model w;(0) = y. 

44 The “three-step” estimator is the iterated estimator with Imaz = 3. 
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increases. However, there is little evidence that the statistic is converging 
to its asymptotic distribution even at these sample sizes. The empirical 
size is closest to its nominal value at T = 1000, T = 2000 and T = 4000 
for respectively M9, M14 and M24. If fewer moment conditions are used 
than this prescription at a given sample size then the test rejects too 
frequently; if too many are used then the test rejects too infrequently. 


4. Iteration: There is no systematic difference between the two- and three- 
step estimators. 


In qualitative terms, these results are broadly similar for various types of 
HAC estimators. However, Andersen and Sgrensen (1996) report that the qual- 
ity of the asymptotic approximation is improved by the use of the prewhitening 
and recoloring advocated by Andrews and Monahan (1992), and also the data 
based bandwidth selection method proposed by Newey and West (1994). The 
evidence also suggests the Bartlett kernel is to be prefered over the quadratic 
spectral in this model, which is counter to asymptotic theory. At the same 
time, it should be noted that the evidence suggests the choice between these 
HAC estimators is of second order of importance. All the HAC estimators con- 
verge fairly slowly to their limit in this class of models. A similar finding is 
reported by Burnside and Eichenbaum (1996), Christiano and den Haan (1996) 
and West and Wilcox (1996) albeit to differing degrees depending on the setting 
in question. 

Two features of Andersen and S¢grensen’s (1996) results stand out. First, a 
large sample size is needed for asymptotic theory to approximate finite sample 
behaviour in these models, and secondly the quality of the approximation de- 
pends on the choice of moments. The culprit in the first case is Sp. Andersen 
and Sørensen (1996) compare Ôr with its simulated population long run vari- 
ance, and find that the former is clearly exhibiting considerable bias and varia- 
tion even at T = 10,000. Given that the moment conditions involve polynomial 
powers, such behaviour would be anticipated for the reason given above. How- 
ever, there may be another facet to this explanation. Altonji and Segal (1996) 
argue that in models of covariance structure this slow convergence means that 
Sp! and T"?gr(Ôr) exhibit a correlation in sample sizes which has a nega- 
tive impact on the quality of the asymptotic approximation.*© Since stochastic 
volatility models involve variances Altonji and Segal’s (1996) arguments may 
well apply here as well. Andersen and Sgrensen (1996) also uncover an interest- 
ing explanation for the sensitivity of the quality of the asymptotic approximation 
to the choice of moment conditions. They calculate the asymptotic variances of 
the GMM estimator implied by the four choices. These figures reveal a dramatic 
drop in variance with the move from M5 to M9, a smaller, but still marked, 
drop in variance with the move from M9 to M14, but only a slight drop in 


45 In contrast, the studies by Burnside and Eichenbaum (1996) and Christiano and den 
Haan (1996) find no clear ranking between the two kernels is possible. See Section 3.5.3 for 
further discussion of their relative merits. 

46 Recall they are statistically independent in the limit because the former converges in 
probability to S~!, a matrix of constants. 
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variance with the move from M14 to M24. It can be recognized that these cal- 
culations are a good predictor of the finite sample behaviour described above, 
that is the properties of the estimator tended to improve as q increased — at 
least in the larger samples — until we reached M14, but then deteriorated with 
move from M14 to M24. These calculations indicate that whether or not it is 
beneficial to expand the population moment condition in finite samples from 
E|fi(vz,90)| = 0 to El fi (vz, 40)| = 0, El fe(vz,90)] = 0 depends on both the 
precise definitions of E[ (vz, 00)| = 0 and E[f2(vt, 00)]| = 0 and also their inter- 
relationship. So in these terms, it can be recognized that this conclusion echoes 
the one drawn from the calculations based on the Nagar type approximation to 
the bias of the linear IV estimator reported in the previous section. 

Since the move from M14 to M24 has only a marginal impact on the asymp- 
totic variance of the estimator, this expansion of the population moment con- 
dition appears to introduce what might be viewed as nearly redundant moment 
conditions. Therefore, this last set of results appears to suggest that the inclu- 
sion of redundant or nearly redundant moment conditions can lead to a dete- 
rioration in the finite sample properties of the estimator. Such an explanation 
would certainly accord with the intuition gained from the Nagar type approx- 
imation calculated by Imbens (2002) discussed in Section 6.2.2. However, this 
example gives no sense of whether the inclusion of redundant moment conditions 
can have such dramatic effects on the quality of the asymptotic approximation. 
While Andersen and Sørensen (1996) did not explicitly pursue this issue further, 
Hall and Peixe (2003) provide simulation evidence which does corroborate this 
conclusion albeit in a different setting. They consider the following linear model 


Y= L490 + Ut (6.29) 
tt = Tozi + Et (6.30) 
where x, is a scalar and % is a 12 x 1 vector. Putting v, = [ur, et, z,], artif- 


ical data are generated using v ~ IN(0,%,) where the main diagonal of X, 
are all set to unity, and the only non-zero off diagonal elements are (1,2) = 
E,(2,1) = cov(u,, es) = 0.5. The parameters are set to 0) = 0 and II, = 
(0.5, 0.5, 0,...0]. Notice that within this design, [z¢,3, 2,4, - - - 2,12] are redundant 
given (2,1, 24,2). Hall and Peixe (2003) consider the behaviour of the set of IV es- 
timators {Êr(i); i = 1,2,...12} where Êr (i) is given by (2.8) evaluated at Wr = 
(T-12Z2'Z)-1, z = a (i) and z%(4) is the i x 1 vector (24,1, 24,2, --- Zti—1, 24,4)” 
This definition implies that, for i > 2, z, (i) contains i—2 redundant instruments. 
Table 6.3 contains the simulated bias and root mean square error of 67(i) along 
with the mean and empirical rejection frequency of the t-statistic for the hy- 
pothesis Hp : 0o = 0 based on 10,000 replications in the case where T = 100. It 
is evident from these results that the quality of the asymptotic approximation 
deteriorates as the number of redundant instruments increases. For example, 
if there are up to three redundant instruments then the empirical rejection fre- 
quency of the t-statistic is close to the nominal value of 10%; however, if there 


47 Notice that, for this design, 67(i) is the optimal two step estimator; see Section 2.4. 
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are nine or ten redundant instruments then the empirical rejection frequency is 
twice the nominal size. 
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Table 6.3 

Consequences of the inclusion of redundant instruments 

bias rmse tstat size 
—0.030 -0.371 0.084 0.075 
—0.000 0.146 0.140 0.095 
0.010 0.143 0.211 0.099 
0.021 0.142 0.282 0.106 
0.030 0.141 0.351 0.114 
0.039 0.141 0.418 0.126 
0.048 0.142 0.484 0.137 
0.057 0.143 0.550 0.148 
0.065 0.144 0.617 0.161 
0.073 0.147 0.682 0.177 
0.080 0.149 0.746 0.195 
0.088 0.152 0.810 0.210 


Source: Hall and Peixe (2003). Copyright Marcel Dekker; reprinted with permission. 

Notes: bias and rmse are the simulated bias and rmse of y(i). tstat denotes the simulated 
mean of t-statistic for Ho : 0o = 0. size denotes the empirical size of the t-test with nominal 
size 0.1. 


While we have reviewed only four studies in detail, their results are repre- 
sentative of this literature. In terms of our five questions, the overall findings 
for the first four are as follows. 


1. 


Bias: the estimator is approximately unbiased in some settings and not 
in others. The bias tends to increase with q — p, the degree of overiden- 
tification, and particularly with the inclusion of a number of redundant 
moment conditions. However, a low value for q — p is not a guarantee 
of the absence of bias because the bias is also sensitive to other aspects 
of the model such as the functional form of moment condition, the time 
series properties of the data and the choice of long run covariance matrix 
estimator. 


C.I.’s: the empirical coverage of the asymptotic confidence intervals is 
sometimes close to the nominal value but more often tends to be less than 
the nominal value. This means the asymptotic confidence intervals tend to 
overstate the precision of the estimation in finite samples. The empirical 
coverage tends to deteriorate with the inclusion of a number of redundant 
moment conditions or in the presence of weak identification. The reliability 
of the asymptotic approximation is also sensitive to the time series prop- 
erties of the data and the functional form of the moment condition; the 
approximation can be extremely unreliable in circumstances where these 
two features of the model interact to cause the long run covariance matrix 
estimator to be ill-behaved. 
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3. Jp: in some cases it is well approximated by a Xapi but in others this 
approximation can be poor. In the latter cases, the test may either reject 
or fail to reject too frequently depending on the model in question. While 
there does not appear to be a systematic pattern to the relationship between 
the empirical and nominal size, the discrepancy between them appears to be 
larger in the presence of weak identification. The reliability of the asymp- 
totic approximation is also sensitive to the time series properties of the 
data and the functional form of the moment condition; the approximation 
can be extremely unreliable in circumstances where these two features of 
the model interact to cause the long run covariance matrix estimator to be 


ill behaved. 


4. Iteration: the quality of asymptotic approximation tends to be greatly im- 
proved by iteration. 


Since only one study has examined the continuous updating estimator to date, 
our conclusions about the fifth question are more tentative. Nevertheless, for 
completeness, we summarize them here. 


5. Continuous updating: this version of the estimator tends to exhibit fat tails 
which may have undesirable consequences for parameter estimation, but 
the associated overidentifying restrictions test may be closer to its asymp- 
totic distribution than its counterpart based on the iterated estimator. 


6.4 Summary and Link to Following Chapters 


In this chapter, we have investigated how well asymptotic theory approximates 
behaviour in samples of the size encountered in practice. It seems fair to say 
that the evidence is mixed. In some models of interest the approximation can be 
good at samples of size 100, and in others it is bad even at samples a hundred 
times larger. Furthermore, for a given functional form, the adequacy of the 
approximation may be very sensitive to the parameter values used to generate 
the artificial data. Perhaps only one thing can be said for certain, that is finite 
sample behaviour is far more complex than would be predicted by asymptotic 
theory. 

In spite of this complexity, the following factors appear to play an important 
role in determining the quality of the asymptotic approximation: 


e the functional form of f(v;, 8o); 


the degree of overidentification, q — p; 


the interrelationship between the elements of f(v:, 90); 


the quality of the identification; 


the estimator of the long run variance. 
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All these factors collectively point to the following conclusion: the exact choice 
of population moment condition is crucial to the performance of the method. 
This observation motivates the material covered in the next chapter. Two spe- 
cific questions are addressed: — is there a feasible optimal choice of population 
moment condition? — how can we select the right set of population conditions 
for the problem in hand? Progress has been made with both questions, but at 
the end of the day, there is still a need to explore methods for improving the 
properties of inference techniques in finite samples. In Chapter 8, we examine 
three methods for achieving this goal. The first is the use of the bootstrap 
to provide more accurate critical points, the second is an asymptotic theory 
which has been developed for the case in which some or all of the parameters 
are weakly identified, and the third is an asymptotic theory in which the HAC 
estimator converges to a random matrix. 


7 


Moment Selection in 
Theory and in Practice 


A researcher is typically faced with a large set of alternatives from which to 
choose the q elements of the population moment condition. This choice can 
be made in an ad hoc fashion, but it is clearly preferable to base selection of 
f(.) upon statistical criteria which reflect the ultimate purpose of the analysis. 
Throughout this chapter, we focus almost exclusively on the common case in 
which the objective is to make inferences about #9 based on the asymptotic 
distribution theory developed in previous chapters. From this perspective, the 
optimal choice of moment condition is the score vector because the resulting 
GMM estimator is the MLE and the latter is known to be asymptotically ef- 
ficient in the class of consistent uniformly asymptotically normal estimators.+ 
Unfortunately, as argued in Section 1.1, ML is infeasible in the types of model 
listed in Table 1.1. Therefore, if any useful guidance is to be provided for these 
settings then optimality must be judged relative to the class of moment con- 
ditions employed in practice for the model under consideration. There have 
been two distinct phases to the literature on moment selection within the GMM 
framework. From the mid-1980s until the mid-1990s, attention focused on the 
use of theoretical arguments to characterize the optimal choice of moment con- 
dition within the class of GMM estimators known as Generalized Instrumental 
Variables. More recently, attention has focused on data based methods for 
moment selection using information criteria. Both phases of the literature are 
reviewed in this chapter. 

To begin this discussion, it is useful to consider what properties it is desir- 
able for the selected moment condition to possess. Using the material from the 
previous chapters, it is argued in Section 7.1 that the selected moment condi- 
tion should satisfy three conditions: the orthogonality condition, the efficiency 
condition and the non-redundancy condition. The latter two are most natu- 
rally considered together and their combination is refered to as the relevance 


1 See Section 3.8. 
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condition. In addition, this section describes both the ways in which moment 
selection complicates the concept of identification, and also how the use of the 
data in moment selection has the potential to contaminate subsequent inferences 
about 0o. 

Section 7.2 reviews the available results on the efficient choice of moment 
condition within Generalized Instrumental Variables (GIV) estimation. Within 
this class of problems, the optimal choice of moment condition is found by char- 
acterizing the optimal choice of instrument vector. The choice of instrument 
vector involves two decisions: which elements of the information set should be 
used? and, which functions of these elements (or variables) should be used as 
instruments? Since the first question depends on the particular model under 
consideration, the answer varies from case to case. Therefore, with few ex- 
ceptions, the literature on optimal instruments has focused exclusively on the 
second question. It turns out that the optimal functional form is relatively 
straightforward to characterize in static models, but far more problematic in 
dynamic models. The relative simplicity of the solution in static models makes 
it far easier to develop an intuition for the form of the optimal instrument in 
this context. It is therefore instructive to start by considering the static case, 
and then to use these results as a stepping stone to the dynamic models which 
are the focus of this book. Accordingly, we split our discussion into two parts 
with Section 7.2.1 covering the static case and Section 7.2.2 considering the 
extension to the dynamic case. In either case, the optimal functional form de- 
pends on aspects of the data generation process which are typically unknown in 
practice. One possible way forward is to estimate these unknown features of the 
data generation process from the sample, and then substitute these estimates 
into the formula for the optimal instrument. However, these auxilliary esti- 
mations encounter a number of practical problems which are also described in 
Section 7.2. In fact, these problems tend to be of sufficient magnitude that the 
“optimal instrument” is rarely used in applications. Therefore, this literature is 
best viewed as providing an efficiency bound for GIV estimators rather than a 
practical method for instrument selection. This bound can be used to compare 
the efficiency of GIV with other estimators, and Section 7.2.3 provides a brief 
review of the available results of this type. 

In contrast to the setting just described, a researcher must decide which 
moments to choose without knowledge of the underlying data generation pro- 
cess. In such circumstances, moment selection must perforce be based upon 
the data, and this is a key feature of all the methods recently proposed in the 
literature. These methods are reviewed in Section 7.3. Section 7.3.1 deals with 
selection based on the orthogonality condition and Section 7.3.2 deals with se- 
lection based on the relevance condition. Section 7.3.3 discusses their sequential 
use to provide a practical method for moment selection and illustrates it us- 
ing Hansen and Singleton’s (1982) consumption based asset pricing model. The 
methods reviewed in Section 7.3.1—7.3.3 can be applied to any GMM estimator 
that satisfies the types of regularity condition in Chapter 3. There has also been 
some related work within the more restrictive setting of GIV estimation. These 
methods are briefly reviewed in Section 7.3.4. 
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7.1 Preliminaries 


To consider the problem of moment selection, it is necessary to introduce some 
additional notation. It is assumed that the candidate set of scalar functions 
which can form the basis for the population moment condition is finite. It is 
convenient to stack these scalar functions into a single vector fmax(.) whose 
dimension is denoted by qmaz. Following Andrews (1999), we use a (maz X 1) 
selection vector c to denote which elements of the candidate set are included in 
a particular moment condition. We therefore now index f(.) by c; cj = 1 implies 
the jt” element of fmax(.) is included in f(.;c), and cj = 0 implies this element 
is excluded. Note that |c| = c'c equals the number of elements in f(.;c). The 
set of all possible selection vectors is denoted by C, that is 


C = {cE Re; cp = 0,1, for j =1,2,...dmasz; 
and c = (¢) .Cas..20g a. )e (El > p} 


Below we use cse; to denote the element of c that indexes the “selected” moment 
condition. For the present, we need not concern ourselves with how this element 
is selected. 

To assess what properties are desirable for the selected moment condition, 
it is necessary to consider the objective of the estimation. Throughout this 
chapter, we follow the empirical literature and assume that this objective is 
to make inferences about 69 based on the two step (or iterated) estimator 
using the GMM asymptotic distribution theory developed in previous chap- 
ters. For purposes of discussion, it is useful to restate the appropriate ver- 
sion of Theorem 3.2 in terms of the notation used here. Accordingly, define 
Êr(c) to be the GMM estimator based on E[f(v;,0;c)] = 0, and let Vo(c) de- 
note the matrix [Go(c)'$(c)~+Go(c)]~+ where Golc) = Efa f (vi, 90; c)/00"] and 
S(c) = limr—o Var[T -1/2 5} f (v1, 00;c)]. The distributional result in The- 
orem 3.2 can then be restated as 


T"? ĝr(c) — O] > N (0, Vole) ) (7.1) 


Our list of three desirable properties for the selected moment condition arises 
from a consideration of the first and second moment properties of this asymp- 
totic distribution, and of its quality as an approximation to finite sample be- 
haviour. 

The distribution in (7.1) has a mean of zero, and so embodies the assumption 
that the GMM estimator is consistent for the true value 09. From Section 3.4, it 
is clear that Assumptions 3.3 plays a crucial role in the derivation of this result, 
and so it is desirable for this condition to be satisfied by the selected vector. 
This observation leads to the following condition. 


2 It should be noted that while this objective is common to many of the studies in Table 
1.1, it is not shared by all. For example, in some cases the main focus of the study is a point 
estimate of a particular parameter and this may necessitate alternative criterion for moment 
selection; see Section 7.3.4 for further discussion. 
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Definition 7.1 Orthogonality Condition 
The selected moment condition satsifies Assumption 3.8, that is E| f (vz, 90; Cser)| 
=0. 


If this condition is satisfied, then the asymptotic distribution can be viewed as 
having the desirable first moment properties. 

In most cases, there is more than one element of C which yields a moment 
condition satisfying the orthogonality condition. It is clearly most desirable to 
base inference on the moment condition with the smallest variance in a matrix 
sense. This leads to the following efficiency condition. 


Definition 7.2 Efficiency Condition 
The selected moment condition is efficient, that is Va(c) — Va (Cse1) is positive 
semi-definite for allc € C such that E| f(v, 00; c)] = 0. 


If this condition is satisfied, then the asymptotic distribution can be viewed as 
having desirable second moment properties. 

It can be recalled from Theorem 6.1 that asymptotic variance can never 
increase as q increases. Therefore, the efficiency condition can be met by basing 
the estimation upon the moment condition consisting of all elements of the 
candidate set that satisfy the orthogonality condition. However, simulation 
evidence indicates that the inclusion of redundant moment conditions can lead 
to a deterioration in the quality of the asymptotic approximation to finite sample 
behaviour.? This consideration motivates the non-redundancy condition. 


Definition 7.3 Non-Redundancy Condition 
No individual element of E|f (vt, 00; Cset)| = 0 is redundant given the remaining 
elements. 


It should be noted that, to date, there are no theoretical results on the deter- 
minants of the quality of the asymptotic approximation in nonlinear dynamic 
models that might provide a basis for selecting moments so that this approxi- 
mation is good. Selection based on non-redundancy is best viewed, therefore, 
as a way of avoiding a situation in which the quality of the approximation can 
be very bad. 

Both the efficiency and non-redundancy conditions relate to the asymptotic 
variance of the estimator, and so it proves useful to treat them simultaneously 
on occasion in moment selection. For expositional brevity, we refer to this 
combination as the relevance condition. 


Definition 7.4 Relevance Condition 
The selected moment condition is said to be relevant for the estimation of 0o if 
it satisfies both the efficiency and non-redundancy conditions. 


The remainder of this chapter focuses on methods for moment selection based 
on the conditions above. Section 7.2 reviews the literature on the characteri- 
zation of choice of moment condition that satisfies the efficiency condition in a 


3 See Section 6.3 
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class of Generalized Instrumental Variables estimators. Sections 7.3.1 and 7.3.2 
describe methods for moment selection based on the orthogonality condition 
and relevance condition respectively, and Section 7.3.3 considers their combined 
use in a sequential fashion. 

It might be wondered why the identification condition (Assumption 3.4) did 
not enter the discussion, particularly since this assumption played a crucial 
role in the analysis in Chapter 3. The reason is that the issue of identification 
is going to become much more complex once allowance is made for moment 
selection. In Chapters 3 and 4, all the analysis is conditional on a given choice 
of f(.). For example, the model is said to be correctly specified if there exists a 
value ĝo such that E[f(vz,0)] = 0, and ĝo was said to be identified if there is 
no other value of 0 which satisfies this moment condition. The setting here is 
different because, by its very nature, moment selection means we must consider 
different choices of f(.). Now it is entirely possible for two different choices of 
moment condition to satisfy the orthogonality condition at different parameter 
values, that is both E[f (vz, 01;c1)] = 0 and E[f2(vz, 02; c2)] = 0. As seen below, 
each of the proposed moment selection methods involves its own particular 
assumption about identification that must hold if the method is to have the 
desired properties. 

We conclude this section by considering a further way in which moment 
selection complicates the analysis. The methods described in Section 7.3 are 
based on the data. However, once the data are employed in this way, a poten- 
tial problem emerges. All the asymptotic theory developed in Chapters 3 and 
5 is premised on the assumption that f(.) is fixed a priori. If f(.) is selected 
from the data then the choice of moment condition may be random, and hence 
the asymptotic properties of the resulting estimator would depend on the sta- 
tistical properties of the selection method. From a practical perspective, it is 
simpler by far if we can proceed with our inference about 09 as if the selected 
moment condition had been fixed a priori. We refer to this requirement as the 
inference condition. This issue has been addressed in the literature by provid- 
ing conditions under which the data based selection vector, Cr say, converges in 
probability to a constant vector because in this case the validity of the inference 
condition can be deduced from the following lemma due to Pétscher (1991). 


Lemma 7.1 Sufficient Conditions for the Inference Condition 
Let ĉr,co E€ C and let hr(c) be any statistic based on E|f(v:,00;c)] = 0. If 
êr & co then hp(ér) — hr(co) = op(1). 


If hr(ér) — hr(co) = op(1) then hr(ér) has the same asymptotic properties 
as hr(co). This lemma provides a theoretical justification for proceeding with 
inference as if c has been set equal to co a priori but, as Potscher (1991) observes, 
this result must be interpreted with some caution. The convergence is only 
pointwise, and Potscher (1991) shows that the convergence may not be uniform 
in some cases of interest. As a result, the asymptotic distribution of hr(co) may 
provide a very poor approximation to the distribution of hp(ér) even in large 
samples. Pétscher (1991) demonstrates this lack of uniform convergence in an 
example where the dimension of the parameter vector is related to the dimension 
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of the model. To date, it is unclear whether these arguments translate to the 
setting here in which the dimension of the parameter vector is independent of the 
moment selection. In the absence of a theoretical resolution, the only guidance 
available is from simulation studies and these studies are reviewed on a case by 
case basis below. Finally, it is useful to introduce an item of terminology. It is 
customary in the model selection literature to say that ĉr is consistent for cp to 
describe the situation in which êr 2 co, and we follow this practice below. 


7.2 The Optimal Instrument 


In this section, we restrict attention to a class of GMM estimators known as 
Generalized Instrumental Variables (GIV) for which the efficient choice of mo- 
ment condition is characterized by finding the efficient choice of instrument. It 
is customary to refer to this efficient choice as the “optimal instrument” for rea- 
sons that become apparent, and we follow this practice. Although our running 
empirical illustration is actually an example of GIV, we have not yet discussed 
this particular class of GMM estimators in its general form.* Since the structure 
of the associated population moment condition is crucial for our analysis here, 
we begin by providing a formal definition of the GIV estimator. 

Within the GIV framework, the population moment condition is based on 
the statistical orthogonality of two vectors. These two vectors are denoted here 
by uz(@9) and z:-m. The vector uz(90) consists of functions of the data and the 
unknown parameter vector, and satisfies the conditional moment restriction 


E[ur(90)|Q¢_-m] = 0 (7.2) 


where Qim is the infomation set at time t — m for some non-negative integer 
m. The exact definitions of Q:-,, and m depend on the assumptions about 
the dynamic structure, and so are provided below on a case by case basis. 
In applications, (7.2) represents the information derived from the underlying 
economic/statistical model. The instrument vector 2; consists of a vector of 
functions of elements of the information set, and so satisfies 


Zt—-m © Ni-m (7.3) 


Using an iterated expectations argument,’ equations (7.2) and (7.3) can be 
combined to deduce the population moment condition, 


Eļ|zt-m ® ut(9o)] = 0 (7.4) 


Hansen and Singleton (1982) refer to GMM estimation based on (7.4) as Gen- 
eralized Instrumental Variables estimation. In view of the genesis of (7.4), the 
researcher needs only to decide which z_,, to use in order to implement GIV. 


4 GIV is also used to estimate the conditional capital asset pricing model (Section 1.3.3) 
and the inventory holdings model (Section 1.3.4). 
5 See Section 1.3.1. 
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Therefore, within this framework, the problem of moment selection reduces to 
one of instrument selection. 

In the literature on optimal instruments, it is customary to work with a 
slightly modified version of the population moment condition.® Instead of (7.4), 
the population moment condition takes the form 


E|f (vz, 90)] = E[Zt-mur(A0)] = 0 (7.5) 


where u+(0o) is a (s x 1) vector of functions which satisfies (7.2), Z:-m is a 
(qx s) matrix and Zt-m E€ Q4-m. If GMM estimation is based on the population 
moment condition in (7.5) with the optimal choice of weighting matrix then it 
follows from Theorems 3.2 and 3.4 that 


T? (ôr — 6) Ż N(0, V(Z)) (7.6) 
where ee i PR 
V2) = (BY Zil SZ Elim (7) 


for Sz = limp_... Var[T 7"? Sean Zı—-mut(0o)]. Since this distribution is cen- 
tred on zero by construction, the optimal choice of Z, is the one which minimizes 
V(Z) in a matrix sense. 

Below we use the notation Z?_,,, to denote the optimal instrument. Since 
this optimality is relative to the class of instruments which lead to an asymptotic 
distribution of the form (7.6)—(7.7), it is necessary that the optimal instrument 
satisfies the regularity conditions for Theorem 3.2. It is most convenient to 
impose these regularity conditions up front. Since our focus here is on the 
functional form of the optimal instrument, we adopt the following high level 
assumption.” 


Assumption 7.1 Regularity Conditions for the Optimal Instrument 
f (vt, 00) = ZP_,,Ut(90) satisfies the regularity conditions for Theorem 3.2. 


7.2.1 Static Models 


For this part of our discussion, ergodicity is replaced by the following more 
restrictive assumption. 


Assumption 7.2 Independence 
{vs; t=1,2,...T} forms an independent sequence. 


Notice that Assumptions 3.1 and 7.2 together imply {v,} forms an independent 
and identically distributed process. 

To proceed further, it is necessary to put some structure on the information 
set which appears in (7.2). Throughout this book, {v+} is taken to be a time 


6 This difference facilitates the analysis but makes no difference to the ultimate result. 
T More primitive conditions can be found in either Gallant (1987), Newey (1993) (for the 
iid case) or Wooldridge (1994). 
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series. Assumption 7.2 implies that v; is independent of the history of the pro- 
cess, V;—1 = (U¢_-1, Ut_2,---). Furthermore, by construction V;—ı is observable 
at time t and so must lie in the information set. However, the information set 
must contain more than this if GMM estimation is to work here. To see why, 
consider the static linear model in Chapter 2 and suppose that x+—ı is used 
as an instrument for x+. In this case, it follows from Assumption 2.3 that the 
condition for identification is rank{E[x,;_12,]} = p. However, if v, and hence 
x, is iid. then 


Elxe-iv;] = ElerajE [ei] = Hete say 


which is rank one by construction. To avoid this problem, it is necessary for 
the information set to contain some contemporaneous variables. Therefore, we 
partition vz; into vs = (v14,V2,)' and define the information set to be Q; = 
{v24,Vi-1}. This structure also means that expectations conditional on 9; are 
identical to those conditional on v2, and so we use the notation E|. | v2] for 
E|. | Q] below. 

The optimal choice of Z; is given by the following theorem. 


Theorem 7.1 The Optimal Choice of Instrument in Static Models 
If (i) vi satisfies Assumptions 3.1 and 7.2; (ii) Assumption 7.1 holds with m = 0; 
then the optimal choice of Z, in (7.5) is given by 

Z? = K E[Ouz(60)/00" | vo E7: 


u|v2 
where K is any (p x p) nonsingular matrix of finite constants and Zuw, = 
E(u:(0o)u:(0o0)'|v2,t]. This optimal choice leads to a GMM estimator with asymp- 
totic covariance matrix 


V(Z°) = {E | B[Our(60)/06" | v2 Ezk E[ðu(60)/30 | v2,]] } 


ulve 


Proof: 
Let 67(Z) denote the GIV estimator based on (7.5) with the optimal weighting 
matrix, and 67(Z°) denote the GIV estimator based on (7.5) with Z, = Z°. 
Notice that Z? is (p x s) and so the choice of weighting matrix is immaterial in 
this case. 

The proof rests on using 


Or(Z) = Or(Z°) + [6r(Z) — êr(Z°)] (7.8) 


to derive an explicit formula for V(Z) — V(Z°) = D(Z). It is then shown that 
D(Z) is positive semi-definite for any choice of Z, which establishes the desired 
result. 

The matrix D(Z) depends on certain asymptotic variances and covariances. 
It is most convenient to define these terms prior to the derivation. These defini- 
tions rest on the random vectors which determine the asymptotic distributions 
of 67(Z) and 67(Z°). From (3.26) it follows that 


T 
T'?[67(Z) — Oo] = TY mi(Z) + op(1) (7.9) 


t=1 
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where 
m(Z) = — [Fz(00) Fz (00) Fz (00) S7 Zi (80) (7.10) 


and Fz(0o) = Sp? E[Z,du (00) /06’. Similarly, the corresponding expression 
for the optimal GIV estimator is given by 


T 
TY? 167(Z°) — Oo] = T X mZ?) + 0(1) (7.11) 
where 
m(Z°) = —{E[Z?Pduz (00) /00']}~' Zp ur (90) (7.12) 


and we have set K = I, without loss of generality. The derivation of D(Z) is 
most readily understood if we adopt a notation which explicitly reflects the vari- 
ance/covariance nature of the terms. Accordingly, we introduce the following 
definitions 


T 
Avar|67(Z)| = jm Var[T "X m (Z) 
t=1 
P T 
Avar|67(Z°)| = jim Var[T ="? Sm (Z°)] 
— CO t=1 
A . T T 
Acov|6r(Z),0r(Z°)| = jim BIDS mZ) m2’) | 
t=1 t=1 
s p T 
Avar|6r(Z) — 67(Z°)] = jim Var[T 1? X di(Z)] 
i t=1 
. 7 T E T 
Acov[67(Z°), Or(Z) — Or(Z°)] = Jim EIT $ F(Z) | ae(Z)}] 
t=1 t=1 
= C, say 


where d;(Z) = m:(Z) — m:(Z°) and the “A” prefix stands for asymptotic. 
Notice that Avar|67(Z)] and Avar[67(Z°))] are just the matrices V(Z) and 
V(Z°) given in (7.7) and Theorem 7.1. 

We are now in a position to derive D(Z). From (7.8), it follows that 


T'?[67(Z) — Oo] = T'?[6r(Z°) — 6] + TY? (6r(Z) — Or(Z°)] 
and so 


Avar|67(Z)] = Avar[6r(Z°)| + Avar[6r(Z) — 6r(Z°)] 
+C +0 (7.13) 


8 Note that V (Z?) is invariant to K. 
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Equation (7.13) can be rearranged to show that 
D(Z) = Avar[6r(Z) — 67(Z°)] + C + © (7.14) 


Since Avar[67(Z) — 67(Z°)] is positive semi-definite by construction, it is suffi- 
cient to establish that C = 0 in order for D(Z) to be positive semi-definite. So 
we now focus on the matrix C. 

From the definition of a covariance, it follows that 


C 


l 
D 
8 
Q 
c 

"D 
ok 
D 
D > 
tl 

N 

a 

l 

D 
"D 
N 
RES 


and so 
C = 0 <> Acov|b7(Z),67(Z°)| = V(Z°) 


It is at this stage that the static nature of the model is exploited because 
Assumption 7.2 implies Acov[47(Z),67(Z°)| = Elm:(Z)m;(Z°)’]. Using an 
iterated conditional expectations argument, it follows that 


II 


[Fz (80)! Fz (00)] Fz (00) S7? E|Z: Elu:(00)u:(80)' 
x|v2 1] Z2"]{ E[(Our (60) /06") (22) I}! 
= [Fz(60)'Fz(00)|- Fz (60)'Sz (7 E|[Z:Zujo,Z? | 


f 


x {E[(ðu(80)/30) (ZP) J (7.15) 


Blu (Z)mu(Z°)'] 


Using the definition of Z?, it follows that 


/ 


E[Z.Dujv.Z, | = E [Z E[Our(80)/00" | v2,2]] (7.16) 
Now, 
where E[A;|Q:] = 0. Since Z; € Q, it follows from (7.16)-(7.17) that 
E[Z:Eujp.Ze] = E [Z ElOus(9o)/06" | vza] 


Substsituting (7.18) into (7.15), we obtain 

Elm:(Z)mi(Z°)'] = [Fz (90)' Fz (80)] Fz (G0)! Fz (80){ B[(Oue(80)/00') Z? ]}7" 
=V(Z°) 

where the last identity follows from (7.17) by similar logic to (7.18). Therefore 


C = 0 and so D(Z) is positive semi-definite which establishes the desired result. 
© 
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One aspect of the proof is worth commenting on. Notice that C = 0 implies 
that T2 [ôr (Z?) —6o] is asymptotically uncorrelated with T™?[ĝr(Z) — 67(Z°)] 
for any other choice of instrument Z.’ 

At first sight, it is not obvious why Z? is the optimal instrument. To help 
develop an intuition for this result, we consider three simple examples involving 
linear models. The first example shows that Theorem 7.1 leads to an IV estima- 
tor which corresponds with the estimation approach proposed in the literature 
on linear simultaneous equations models. The second two examples illuminate 
the role of %,,),, in the construction of Z?. After these examples, it is shown 
how the intuition from linear models can also be used to understand the form 
of the optimal instrument in nonlinear models. 


Example: Linear Model with s = 1 and Conditional Homoscedasticity 
In Chapter 2, we consider the case in which the population moment condition 
takes the form, 


where uz(90) = yt —~2,00. Notice that if we set zt = Zt—-m then (7.19) is a special 
case of (7.5). For our puposes here, it is sufficient to restrict attention to the 
case in which p = 1 and so 2; is a scalar. We also now add the restriction that 
u+(0o) is conditionally homoscedastic, and denote this variance by Euj, = oĝ. 
Since 0u;(69)/00 = —2;, the optimal instrument is given by 

Z = -Apn | v2.4] 

Tg 

for any non-zero finite constant k. However, since k can take any such value, 
we are free to set k = —oê in which case the optimal instrument reduces to 
Z? = E[x;| v2]. In other words, the optimal instrument is just the part of x; 
which can be explained by v2,. It can be verified that the resulting IV (GIV) 
estimation with z= Z? is identical to OLS estimation of 69 based ont? 


ye = Elay|vor]09 + te 


The latter is described by Theil (1971)[p.452] as “an obvious estimation pro- 
cedure” in his discussion of estimation in linear simultaneous equation models. 
° 


Example: Linear Model with s = 1 and Conditional 


Heteroscedasticity 
Suppose now that we modify the previous example by introducing conditional 
heteroscedasticity in uz(@9), that is E[ue(@0)?|v24] = 07, but leave all other 


9 West (2001) uses this property to characterize the optimal instrument, and considers 
conditions under which this holds in dynamic models. 

10 Recall that by definition of the conditional expectation, xt = Elaxz|v2,t] + et where 
Efez|v2,t] = 0. 
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aspects of the specification the same. In this case, the optimal instrument takes 
the form z 
Zz? = —-— Ele: | v2.4] 

Of 
for any non-zero finite constant k. This time, it is not possible to eliminate o? 
by judicious choice of k — although we can set k = —1 to remove the minus sign. 
In this case, Eziz scales E[Ouz(00)/00|v2,4] to take account of the conditional 
heteroscedasity in u;(00). ° 


Example: Linear Model with s = 2 and Conditional Homoscedasticity 
Now suppose that 


— 27140 
wld) = [ 7 Ehatne J 
Y2,t — Tə +80,2 


where fo = CERTO 60, is p; X 1 and assume that u:(0o) is conditionally 
homoscedastic with 
E[uz(90)ue(90) | v2] = Lo 


With this specification, the components of Z°(.) are given by 


Out(Oo) 
v[ 2 


II 
lea] 
-— 
| 
~B 
ies 
— 
j=) 
One 
N 
a 
N 
= 


Bae. ae ag 


ulve 
where 0, is the a x 1 null vector. In this case, Ezz scales E[Ouz(60)/06'|v2,] 
to take account of any difference between the variances of u1 (00) and u22(4), 
and any covariance between w (00) and u2 (00). ° 


These examples provide an intuition for the structure of Z? in linear models. 
To develop a comparable understanding for the nonlinear model, it is necessary 
to explain why Z? depends on Ou;(90)/00’. The explanation can be found by 
comparing the determinants of the asymptotic behaviour of T"? (Ôr — 69) in 
linear and nonlinear models. To simplify the exposition, we consider the case 
in which s = 1; we also introduce “L” and “N” subscripts on Êr to distinguish 
the linear and nonlinear cases. For the linear model in Chapter 2, we have!! 


TY? (Ôr r — 00) = {(T 1X’ Z)We(T 1 Z'X)} (TAX! Z)We (TO? Z'u) 


(7.20) 
For the nonlinear model, it can be shown that! 
TV? (6n,r—9%) = —{[T~'D(6o)'Z|\Wr[P-*Z D()]}* 
x [T-1D(0o)'Z |Wr T-/?Z' u(o) + op(1) (7.21) 


11 See equation (2.23). 
12 See equations (7.9)—(7.10). 
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where D(69) is the T x p matrix with t” row Ou,(09)/00', Z is the T x q 
matrix with t” row Z;, and u(0o) is the T x 1 vector with t” element u; (60). 
A comparison of (7.20) and (7.21) reveals that the asymptotic behaviour of 
T'/2(6n 7 — Oo) is identical to that of T!/2(6,,7 — 0) in a model with regressor 
vector, xe = —Ouz(99)/00’ and error, uz(Oo). This equivalence can be used to 
translate the intuition from linear models to their nonlinear counterparts.!? 

While Theorem 7.1 characterizes the optimal instrument, it does not by it- 
self solve the problem of instrument selection. The function Z? depends on 
E|Ou;(9)/0' | v24] and, in most cases, Xuv, as well; neither of these functions 
are typically part of the specification of the underlying economic/statistical 
model. Therefore, Z? is an infeasible choice of instrument. One natural solu- 
tion is to estimate the components of Z? from the data. In some cases, this 
approach may be plausible. One such case is the linear model in our first exam- 
ple above in which case the feasible optimal IV estimator is just the Two Stage 
Least Squares estimator as we now illustrate. 


Example: Linear Model with s = 1 and Conditional Homoscedasticity 
(continued) 
To construct a feasible optimal instrument, it is necessary to specify a functional 
form for E[x,|v2,,]. Therefore, we assume that x+ is itself generated by a linear 
regression model, ; 

Lt = Vao + et (7.22) 


where E[e¢|v2,1] = 0 and Ele?|v24] = rê. With this specification, the optimal 
instrument is ; 

Z? = V20 
where we have set k = —o@ as discussed above. To construct a feasible coun- 
terpart to Z?, it is necessary to estimate yo. Under the above conditions, it is 
natural to estimate yo via Ordinary Least Squares applied to (7.22). If this is 
done then the resulting IV estimator of 0 is 


5, 2 VaV Ve) Vay 
OO aVa(Va Va) "Vz 


in the obvious notation. This estimator can be recognized as the Two Stage 
Least Squares (2SLS) estimator of 69 which is familiar from the linear simulta- 
neous equations model literature.'+ Using similar arguments to Section 2.3, it 
can be shown that 67 has the same asymptotic distribution as the IV estimator 
of ĝo with x = Z?. Therefore, the 2SLS can be interpreted as the feasible 
optimal instrumental variable estimator within this model.!° © 


13 Recall that a similar linearization of the moment condition lay behind the construction 
of the identifying restrictions; see Section 3.4.2. 

14 See Theil (1971) [p.451-454]. 

15 This result also applies if p > 1. 


7.2 The Optimal Instrument 245 


In this example, the construction of Z? rests crucially on the assumption 
that E[x,|v2,] is linear. This specification may be very natural in some con- 
texts — such as the linear simultaneous equations model — but may not be so 
appropriate in others. A comparable approach in nonlinear models would re- 
quire an assumption about the conditional mean of Ouz(69)/06’. Unfortunately, 
this is unlikely to be an aspect of the data generation process specified by the 
underlying economic model. One alternative is to use non-parametric methods 
to approximate this expectation. However, since our ultimate focus is dynamic 
models, we do not explore these methods further here. Instead we refer the 
interested reader to Newey (1990) or the survey in Newey (1993).'° 


7.2.2 Dynamic Models 


A number of papers consider the extension of Theorem 7.1 to dynamic models. 
As might be imagined, the characterization of the optimal instrument depends 
crucially on specific assumptions about the dynamic structure of certain aspects 
of the data generation process. In many cases of interest, economic theory 
provides little guidance on these aspects, and even in cases where the economic 
model provides this type of information, the construction of a feasible version 
of the optimal instrument is intractable. Consequently, there have been few 
attempts to implement GIV with the optimal instrument in the types of model 
in Table 1.1. In view of this, there seems little value to reproducing here the 
very technical analysis needed to rigorously justify the functional form of the 
optimal instrument in dynamic nonlinear models. Instead, we focus on two 
relatively simple dynamic structures, and present only heuristic arguments. Our 
discussion rests heavily on the framework developed in Hansen (1985), and, to 
a lesser extent, the earlier work by Hansen and Sargent (1982) and Hayashi 
and Sims (1983); the interested reader is referred to these sources for the more 
technical details.1” 

Throughout this sub-section, the information set, Q:-m, is assumed to con- 
tain the information in the series up until time t — m. However, to present a 
more formal definition, it is necessary to place additional structure on v. In our 
notation, v; represents the vector of random variables which appear in the pop- 
ulation moment condition. In many cases in which GIV is applied to dynamic 
models, the instruments are lagged values of variables which appear in u;(4). 
For example, in Hansen and Singleton’s (1982) consumption based asset pricing 
model, u:(0o0) depends on #1441 = cepi/c and %2441 = 1+41/pz, and in our em- 
pirical implementation, the instrument vector contained the constant and lagged 
values of 21,441 and £241.18 This approach is so commmon in practice that we 
lose little generality by assuming it is followed here. Accordingly, we partition 
Vg = (Vig, vz) and assume that v2 + contains functions of lagged values of v1 4. 
In this case, the information set is defined to be: Qi-m = {V14—m) V1,t-m—1; ---}- 


16 Note that Newey (1993) considers this issue in the context of cross section data and so 
his information consists of v2, alone. 

17 Also see Hansen, Heaton, and Ogaki (1988), Bates and White (1990) and West (2001). 

18 See Section 3.2. 
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We consider the form of the optimal instrument under two assumptions 
about the dynamic structure of u;(09). In the first, u+(0o) is a martingale 
difference with respect to Q:-1 and so Z;_1u;z(89) is an uncorrelated sequence. 
In the second, uz(0) isa VMA (n) process, and so Z;_n—1Uz (00) is a n-dependent 
process. We start with the simplest case. 


Assumption 7.3 Martingale Difference with Respect to Qı 
uz(@) is a martingale difference sequence with respect to Q4-1. 


One consequence of this assumption is that f(vz,00) = Zt-1w+(00) is a serially 
uncorrelated process, and it is this property which is important here. An inspec- 
tion of the proof of Theorem 7.1 reveals that the serial independence of {v;} is 
only important because it implies { f (v+, 00)} is serially uncorrelated. Therefore, 
Theorem 7.1 extends directly to the martingale difference case. 


Theorem 7.2 The Optimal Choice of Instrument in Dynamic Models 
(i): u(0o) is a Martingale Difference with Respect to Qı 

If (i) v+ satisfies Assumptions 3.1 and 3.8; (ii) Assumptions 7.1 (with m = 1) 
and 7.3 hold then the optimal choice of Z,-1 in (7.5) is given by 


Z? = K E[Ouz(00)/00' | wl EF 


where K is any (p x p) nonsingular matrix of finite constants and S41 = 
E(u:(0o)ut(00) |Q:-1]. This optimal choice leads to a GMM estimator with 
asymptotic covariance matrix 


V(Z°) = {E | E[ðu:(90)/30' | Uw- E4 Elðu:(00)/30' | walh 


Just as before, the optimal instrument is infeasible because it depends on 
unknown aspects of the data generation process. In view of the relative simplic- 
ity of the dynamic structure, it might be hoped that it is possible to construct 
a feasible counterpart to Z? 4. However, this hope is misplaced in most cases 
of interest. The following example illustrates the problems encountered. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

It can be recalled from Section 1.3.1 that u:(0o) is a martingale difference se- 
quence in Hansen and Singleton’s (1982) version of the consumption based as- 
set pricing model. In our earlier discussion of this model, we denoted the key 
random variables ct+1/c¢ and rt+1/pt by £1,t+1 and z241. We continue this 
practice here. This means x44; = (£1 t41: £2441) plays the role of vı in our 
discussion above, and so, for consistency, we denote the information set by 0: 
instead of Q—1. With these adjustments in notation, Theorem 7.2 implies the 
optimal instrument depends on: 


E[ðu:(00)/30 | Q] 


II 


E dolog (a1,041) (1,041) 2,041 |Q 
(£141) T E2 t41 t 


B = E[(6021%4422,41 — 1)? | Q] 
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To calculate these two components of Z? directly involves knowledge of certain 
aspects of the conditional distribution of 7,4, given Q. A review of Section 
1.3.1 reveals that these aspects are not specified as part of the underlying eco- 
nomic model. There are two natural ways forward. First, in the spirit of 2SLS, 
models can be assumed for E[Ouz(09)/00" | Qı] and X. Secondly, the condi- 
tional distribution of x44, can be estimated and then the relevant conditional 
expectations can be approximated using a numerical integration technique such 
as quadrature.!? We consider these in turn. 

The first approach shares both the strengths and weaknesses of 2SLS. If the 
assumed models for E[Ouz(00)/00’ | Qı] and £; are correct then their estimated 
versions can be used to construct a feasible optimal instrument. If the assumed 
specifications are wrong, then clearly the resulting instrument while feasible 
is not optimal. Unfortunately, as mentioned above, economic theory provides 
little, if any, guidance on suitable specifications. 

The second approach raises two problems. First, it requires precisely the type 
of distributional assumption which the use of GMM was supposed to avoid.?° 
Secondly, the estimation is likely to be computationally very burdensome. Both 
these problems are sufficient by themselves to make the use of a feasible opti- 
mal instrument unattractive. A further disincentive is provided in a simulation 
study reported by Tauchen (1986). In the controlled simulation environment, 
the data generation process is known and so the calculations are more straight- 
forward, although still computationally burdensome. He finds that in many 
cases the numerical optimization routine failed to converge when the optimal 
instrument was used. Therefore, the attempted use of the feasible optimal in- 
strument undermined the estimation completely. © 


From an analytical perspective, the martingale difference case represents the 
best possible scenario because { f (v+, 00) = Z:—1uz(09)} is a serially uncorrelated 
process and so Theorem 7.1 translates directly. Once serial correlation is intro- 
duced, the form of the optimal instrument must change. To illustrate how, we 
now consider the following case. 


Assumption 7.4 Moving Average Case 
uzt(O0) is generated by the following VMA(n) process 


uz(8o) = A(L)e: = Ct + Aiei + A2642 O An€t—n 


where {ez} satisfies E[e,|Qr-;] = 0 and Varle;|Q:_;] = Is for alli > 0, and the 
roots of det[A(s*)| = 0 lie outside the unit circle. 


Under this assumption u;(09) is a homoscedastic, invertible VMA(n) pro- 
cess.?? With this specification, E[uz(00) | Qr-x] is zero for k > n but is non-zero 
in general for k < n. Therefore, we set m = n + 1 in (7.5). 


19 See Tauchen (19856, 1986), Tauchen and Hussey (1991), Ghysels and Hall (1993). 

20 See Sections 1.1 and 1.3.1. 

21 See Section 6.3 for further discussion of this study. 

22 The assumptions of invertibility and conditional homoscedasticity can be relaxed; see 
Hansen, Heaton, and Ogaki (1988) and Heaton and Ogaki (1991). 
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Although Theorem 7.2 does not directly apply to this new setting, we can 
exploit it here as part of the following three-step strategy for deducing the form 
of the optimal instrument. 


Step 1: Transform u:(0o) into a process ti:(99) so that Zt+—-n—1ŭ+(00) is a serially 
uncorrelated process. 


Step 2: Use Theorem 7.2 to characterize the optimal instrument in the trans- 
formed model. 


Step 3: Reverse the transformation to deduce the form of the optimal instru- 
ment in the untransformed model from the result in Step 2. 


To execute this strategy, we must find the appropriate transformation. As we 
review possible candidates, there is one implicit consequence of the assumed 
specification which plays a particularly important role, and so it is useful to 
highlight this feature at the outset. Since Q, = {vi2,Uij—1,---}, Zt E O 
implies Z, € Qi; for j > 0 but it does not imply that Z; E€ Q; for j > 0. 
In other words, Z; is not strictly exogenous.?? The key consequence of this 
structure is that 


E[Z,_n—1Ui(9)] = 0 for i = t (7.23) 
~ 0 for i < to and some to < t (7.24) 


The obvious first candidate for the transformation is A(L)~! because the 
resulting process A(L)~'u;(@9) equals e;.24 However, a closer inspection reveals 
this filter does not meet our requirements here. Setting A(L)~! = 1+5; Ail’, 
it follows that 


E[Zi—-n—-1A(L) "4 (90)] = E[Ze—-n—1{ur(80) + Arue—1(80) + A2ut—2(00) +.. -H 
= E[Zp-n—1ut(90)] + Ar E[|Zi-n-1ut-1(80)] 
+ Ao E|Zi-n-1ut-2(00)] Teig (7.25) 


Using (7.23)-(7.24) to evaluate this expectation, it is apparent that 
E|Zt-n-1A(L) tu (6o)] # 0 


Therefore, GIV estimation based on the assumption that this expectation is 
zero would lead to an inconsistent estimator of Oo. 

The problem here clearly stems from that backward nature of the filter 
A(L)~!.2°. Fortunately, this is not the only type of filter which can be used 
to remove the autocovariance structure of uz(0o). Hayashi and Sims (1983) 


23 See Engle, Hendry, and Richard (1983) for a discussion of various types of exogeneity. 

24 The filter A(L)—! is actually an infinite order polynomial in L and so would have to 
approximated by a finite order polynomial in practice but this can be ignored here. 

25 The filter is said to operate “backwards in time” because the filtered value of ut(90) only 
depends on its current and past values. 
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suggest using the forward filter A(L~1)-! = 1+ 7%, A,L~*.?® This filter not 
only removes the autocovariance structure but also produces a sequence which 
is still orthogonal to Z;-n-1.7” To see this, let i(@9) = A(Z~')~!uz(@0) and 
observe that 


E[Zt-n-1t4(90)] = E[Z¢-n-1{ ut (90) + Avw+1(40) + Agur42(O0) + -.-}] 
= ElZ~n-1ut(90)] + A E[Zt-n-1ut+1(00)] 
+ AgE[Z;—-n—1Ue+2(80)] +... (7.26) 


Using (7.23), it is straightforward to deduce from (7.26) that E|Zt-n-1%(00)] = 
0. In view of this, it is the forward filter which is used here to remove the 
autocovariance structure. 

The next stage in the analysis involves the characterization of the opti- 
mal instrument in the transformed model. The transformation ensures that 
Zt-n—1ŭ(00) is a serially uncorrelated process and so we can appeal to Theo- 
rem 7.2 to deduce that the optimal choice of Z;_,_1 in the transformed model 
is given by 


Z}-n—1 = Eŭ (90) /06" | Q—n—1]' {E [te (40) ie (o)'|Qe—n—a}}* (7.27) 


It only remains to reverse the transformation in order to deduce the form 
of the optimal instrument in the original model. At first glance, this objective 
would appear to be met by premultipying Z° by A(L~!). However, this is 
not so because A(L~!)Z° ¢ Qy_-n—1 due to the forward nature of the filter. 
Instead, Hansen (1985) shows that the appropriate transformation is given by 
the A(L)~1, and so the optimal instrument can be calculated via the recursion 


Z? = AZo a HAZ? ot... tAnZe,+ZP (7.28) 


To construct this optimal instrument in practice, it would be necessary to trun- 
cate the infinite order filter A(L)~!. Therefore, Hansen (1985) suggests using 
(7.28) with Z? = 0 fori = 0,—1,...— n. 

For completeness, we summarize the previous discussion in the following 
lemma.?® 


Lemma 7.2 The Optimal Choice of Instrument in Dynamic Models 
(ii): Moving Average Case 

If (i) v satisfies Assumptions 3.1 and 3.8; (ii) Assumptions 7.1 (with m = n+1) 
and 7.4 hold then the optimal choice of Zp-n—1 in (7.5) is given by 


1 T KALD) Zi n-i 

26 Again we will ignore the infinite nature of the filter for the time being and concentrate 
on showing that the technique solves our problem. This filter is said to act “forwards in time” 
because the filtered value of uz(90) is a function of its current and future values. 

27 See Hayashi and Sims (1983) for further discussion of the properties of forward filters. 

28 We omit the characterization of associated asymptotic variance because the resulting 
expression is very complicated, and provides no additional insights; see Hansen (1985) for 
further details. 
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where K is any (p x p) nonsingular matrix of finite constants, 
2) n—1 = Eðã(00)/30' | w-n-1] {Elä (0) tie (80)! | On 1) 
and %,(09) = A(Z~1)~1u;, (90). 


To date, this result has had little, if any, impact on the empirical literature 
because of the complexity of the calculations involved. If n is known, then the 
estimation of A(L) is conceptually straightforward but nevertheless computa- 
tionally burdensome.”® If n is unknown then this burden is increased by the need 
to estimate the order of the VMA. The estimation of E[O&:(0@9)/00" | Q4—n] is 
problematic for all the reasons described in the martingale difference case above. 

While the estimation of Z?_,,_, is fraught with problems, there are grounds 
for anticipating that, under some circumstances, an indirect approach may yield 
an IV estimator which achieves the efficiency bound implied by Lemma 7.2. In 
order to be able to elaborate on this statement, it is useful to consider first the 
form of the optimal instrument in a simple example.®° 


Example: Univariate Linear Regression Model with MA(1) Errors 
Suppose that u:(@0) = yz — 2490 and p = 1 so that 2; is a scalar. Let u:(0o) 
satisfy Assumption 7.4 with n = 1. In this case, there is only one moving average 
parameter which is denoted by A here for simplicity, and so A(L) = 1+AL. The 
forward filter is then 


AHE = be AETI APE my (7.29) 
Using (7.29) and Ouz(09)/00 = —2+, it follows that 
E[Ott(A9)/08 | Qe] = Elz: ALt41 H A? Ttt GEET | Qu] (7.30) 


To proceed further, it is necessary to make an assumption about the data gen- 
eration process for x+. So we now suppose that 


Lt = Tuy + Ert 


w = Yw + ewt 


where ws E€ -2, y| < 1, {eit} is ii.d. for i = x,w, Ef|ert|Qi-2] = 0, and 
Eļ|ew t|Wwi-1, We-2,---] = 0.3} Note that this specification implies 


Lttm = TWtym + Er, tm 


m-1 
T wy we + X Ww t+m—i + Cx tm 
1=0 


II 


29 Recall that this burden motivated the use of a VAR approximation to a VARMA process 
in den Haan and Levin’s (1996) covariance matrix estimator; see Section 3.5.2. 

30 This example is based on personal correspondence from Ken West, and I am very grateful 
for his permission to use it here. 

31 Note that for this specification to be logically consistent, w; cannot be a lagged value of 
either x; or yz. Therefore, we must modify our definition of the information set used above 
to include a third variable wt. 
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Therefore, it follows that 


E[d%x(80)/00|Q:-2] = —m{we — Ayw: + (Ap w — -.-} 
= —7w,/(1+ AW) 


and so the optimal instrument is 


Zep = (1 +ALL) yw 
= yw, — ywi + yA w2 — ... (7.31) 
where y = =r /(1 + Ay). © 


In this example, it is necessary to estimate m, A and w in order to construct a 
feasible version of Z? _,. However, the structure of (7.31) suggests an alternative 
approach may be viable. Since Z?_, is a linear function of {w:_;; j = 0,1,2,...}, 
the “optimal” population moment condition E[ZP_ju:(00)| = 0 is implied by 
the set of population moment conditions {E[w;—j;u:(00)| = 0; j = 0,1,...}. 
Therefore, Hayashi and Sims (1983) suggest bypassing Z?_, and estimating ĝo 
from the population moment condition 


Elz:(ar)uz(90)] = 0 


where z:(qr) = (we, Wt—1, Wi-2,--- Wt—-qr)’. Hayashi and Sims (1983) argue 
that if the optimal weighting matrix is used and gr — oo with T then the 
resulting estimator is as asymptotically efficient as the estimator based on the 
optimal instrument.*? In spite of its intuitive appeal, this conclusion should be 
treated with some caution. Hayashi and Sims’s (1983) analysis is premised on 
the assumption that both estimators have an asymptotic normal distribution, 
but their analysis does not consider the rate at which qr must increase in order 
for this to be true.*? Nevertheless, it seems plausible that the result holds 
under certain conditions both in the linear regression model case considered by 
Hayashi and Sims (1983), and also in nonlinear models as well. 


7.2.3 Efficiency Comparison with Maximum Likelihood 


It is remarked above that GIV estimation is only undertaken in situations in 
which Maximum Likelihood is infeasible. In view of this background, intuition 
suggests that the resulting GIV estimator is less efficient asymptotically than the 
Maximum Likelihood estimator. Although there have been only a few formal 
comparisons of the two methods in the literature, the previous statement is 
most likely a good guide. Nevertheless, there are a couple of exceptions which 
are worth noting. Both involve linear models and Maximum Likelihood under 
a normality assumption. The first case is the linear simultaneous equation 
models in which 2SLS is as asymptotically efficient as the Limited Information 


32 Hayashi and Sims (1983) analysis is confined to linear regression models but allows for 
p>. 
33 See Section 6.1.3. 
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Maximum Likelihood; see Theil (1971)[p.507]._ The second case is univariate 
ARMA(m,n) models. Stoica, Söderström, and Friedlander (1985) show that an 
IV estimator of the AR parameters is asymptotically as efficient as Maximum 
Likelihood.** Linearity plays an important role in such results, and this type of 
equivalence is unlikely to extend to nonlinear models. To date, the only results 
available are in the context of nonlinear simultaneous equation models with a 
normality assumption on the errors. In this context, Jorgenson and Laffont 
(1974) and Amemiya (1977) show that IV is indeed less efficient asymptotically 
than Maximum Likelihood. 


However, there is a sense in which GIV estimation based on the optimal 
instrument is the best we can do given the information available. Chamberlain 
(1987) shows that, in static models, the matrix V(Z°) in Theorem 7.1 repre- 
sents a lower bound on the asymptotic covariance matrix of any consistent and 
asymptotically normal estimator of 0) in which the only substantive information 
used in estimation is the population moment condition in (7.5).°° 


7.3 Moment Selection in Practice 


Once Maximum Likelihood is ruled out, the extant results on optimal moment 
selection do not provide a practical solution to the problem of moment selection. 
An immediate problem is that results have only been obtained for the class of 
moment conditions associated with GIV estimation. However, even in this case, 
the practical value is limited for three reasons. First, the results characterize 
the efficient member out of the set of moment conditions which satisfy the or- 
thogonality condition, but provide no guidance on how to identify this reference 
set. Secondly, it turns out that the construction of the optimal instrument is 
computationally burdensome and requires the assumptions about aspects of the 
data generation process which are typically not specified as part of the underly- 
ing economic model. Thirdly, the optimal instrument has desirable asymptotic 
properties but there is no guarantee that this translate to desirable finite sample 
properties. Therefore, in this section, we consider methods of moment selection 
which are arguably of more practical relevance. Sections 7.3.1 and 7.3.2 discuss 
methods for moment selection based on the orthogonality condition and rele- 
vance condition respectively, and Section 7.3.3 considers a method of moment 
selection based on their sequential use. This section also contains an applica- 
tion of the methods to Hansen and Singleton’s (1982) consumption based asset 
pricing model. Section 7.3.4 reviews some related methods which have been 
proposed for Instrumental Variables estimators. 


34 Also see Hansen and Singleton (1991, 1996). 
35 Chamberlain’s (1987) analysis is based on a form of semiparametric Maximum Likelihood 
estimation known as Empirical Likelihood; see Section 10.2. 


7.3 Moment Selection in Practice 253 


7.3.1 Selection Based on the Orthogonality Condition 


To implement a data based model selection based on this criterion, it is nec- 
essary to find a statistic which can indicate whether or not the orthogonality 
condition is satisfied. The obvious candidate is the overidentifying restrictions 
test statistic, Jr, given in equation (5.2). For, although we did not use this ter- 
minology in Section 5.1, it can be recognized that the orthogonality condition 
is in fact the null hypothesis of this test. Andrews (1999) considers a number 
of ways in which this statistic can be used as a basis for moment selection, and 
derives their statistical properties. In this section, we concentrate on Andrews’s 
(1999) information criterion based approach because simulation evidence sug- 
gests this method works best. However, the other methods are briefly discussed 
at the end of this sub-section. 

Information criterion have been applied to the problem of model selection 
in a wide variety of settings. In the case here, the criterion is the sum of two 
terms: the overidentifying restrictions test and a “bonus” term which reflects 
the number of overidentifying restrictions. This criterion is evaluated for all 
possible choices of moment condition, and then the selected moment condition 
is the one which minimizes the criterion. To express this idea mathematically, 
it is necessary to index the overidentifying restrictions test statistic by c, the 
selection vector introduced in Section 7.1. Therefore, we define Jr(c) to be 
equal to Jr in (5.2) evaluated at f(.) = f(.;c). The moment selection criterion 
takes the form 

MSC(c) = Jr(c) + B(T, |el) (7.32) 


where B(T, |c|) is the aforementioned bonus term. The selected moment condi- 
tion is given by ĉr, the choice of c which minimizes the criterion, that is 


êr = argminceec MSC(c) (7.33) 


Although the minimization is defined over C, Andrews (1999) observes that 
it may be more appropriate to consider a reduced set of possibilities in certain 
circumstances. For example, in our consumption based asset pricing example, all 
the moment conditions are derived from the same Euler condition. If one such 
condition is invalid, then the underlying model is wrong and it makes little sense 
to base the estimation upon only those moment conditions that appear valid. 
In this case, an argument can be made for testing the validity of the candidate 
set alone. In other cases, moment conditions may be naturally associated with 
different aspects of the underlying specification, and so it may be desired to assess 
the validity of different groups of moment conditions using MSC (c). For example, 
in the stochastic volatility model in Section 1.3.5, different moment conditions are 
associated with different aspects of the assumed distribution of the series. 

In spite of the previous remarks, we focus on the limiting properties of 
ĉr as defined in (7.33) and what they imply about this method of moment 
selection.3 In order to develop this analysis, we must: (i) make assumptions 


36 Tt is relatively straightforward to modify the analysis to accommodate minimization over 
a restricted set of possibilities, and so this is left to the interested reader. 
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about the limiting behaviour of J7(c); (ii) specify the properties of the bonus 
term, B(T,|c|); (iii) impose certain identification conditions. We address these 
three in turn below. 

In Section 5.1, it is shown that the overidentifying restrictions test statistic 
converges to a Nes distribution if the null hypothesis is correct, but diverges to 
infinity if the null is invalid. It is important that both these properties hold here. 
However, for the specification of the bonus term below, the rate of divergence 
is also important. It can be recalled from Theorem 5.2 and 5.3 that rate of 
divergence depends on the way in which the long run variance is estimated. 
Following Andrews (1999), it is assumed here that this variance is estimated 
using a mean correction discussed in Section 4.3 so that the resulting estimator is 
consistent regardless of whether or not the orthogonality condition is satisfied.°” 
Therefore, we impose the following high level assumption on the overidentifying 
restrictions test; more primitive conditions are given in Theorems 5.1 and 5.2. 


Assumption 7.5 Regularity Conditions for Jr(c) 
(i) If El f(vr,8;0¢)] = 0 for a unique 0 € © then Jr(c) 4 Xid=p (ii) if 
Elf (v%,9;0)] = u(0) 4 0 for all 0 € © then T™!Jr(c) & afc) where a(c) is 


a finite postive constant dependent on c. 


The bonus term takes the form 
B(T, |e) = —h(lel)rr (7.34) 
Its constituents are assumed to satisfy the following conditions. 


Assumption 7.6 Regularity Conditions for the Bonus Term 
(i) h(.) is strictly increasing; (ii) kr — œ as T > œ and kr = o(T). 


Notice that under these conditions, the bonus term decreases as |c| increases, 
and so, since MSC (c) is minimized, rewards selection vectors which include more 
elements from the candidate set. To implement the method, it is necessary to 
choose specific functions for h(.) and xr. We consider two choices here, both 
of which are suggested by earlier work on order selection in autoregressive time 
series. The first involves 


h(|e|) = |el| — p and kr = In(T) (7.35) 
and corresponds to the BIC proposed by Schwarz (1978). The second involves 
A(\el) = lel — p and kr = blnj|in(T)] (7.36) 


where b is a finite constant greater than 2. Andrews (1999) recommends setting 
b = 2.01. These choices of h(.) and kr correspond to those proposed by Hannan 


37 Hall, Inoue, and Peixe (2003) show that the consistency result in Theorem 7.3 still holds 
if the long run variance is estimated using an uncentred HAC with appropriate modification 
of Assumption 7.5. However, we assume here that a centred covariance matrix is used because 
this is the way in which the method is normally implemented. 
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and Quinn (1979), and so the implied criterion is often denoted HQIC. A third 
popular choice in the autoregressive time series literature is the AIC proposed 
by Akaike (1974), but the analogous choice of kr does not satisfy Assumption 
7.6. Therefore we do not consider it further at this stage, but, in view of its 
popularity, return to it once we have established the properties of selection 
methods based on bonus terms which do satisfy Assumption 7.6. 

As explained in Section 7.1, there are going to be two layers to the necessary 
identification conditions. First, there must be a unique c which minimizes the 
population analog to (7.33). Secondly, given this choice of c, the orthogonality 
condition must be satisfied at a unique value of 0 — or in other words, the pa- 
rameter vector must be identified by the selected moment condition. Conditions 
for the latter have already been presented in Section 3.1. So our focus here is on 
the identification of the selection vector. It is useful to derive the appropriate 
condition in steps. To begin, recall from above that different choices of f(.) can 
satisfy the orthogonality condition at different values of 6. Therefore, we define 
Z? to be the set of selection vectors for which f(.; c) satisfies the orthogonality 
condition for some parameter value, that is 


Z? = {cE C such that E[f(v;,9;c)] = 0 for some 0 € O} 


From this set, we need to distinguish those selection vectors which include the 
most elements of the candidate set, that is 


MZ? = {ce Z° such that |e| > |c*| for all c* € Z° } 


For the population analog to (7.33) to have a unique minimum, this set must 
contain only one vector which we denote by co below. Perforce, this condition 
implies |c,| > p.38 We now impose this condition along with the requisite 
condition for parameter identification. 


Assumption 7.7 Identification Conditions 
(i) MZ? = {co}; (ii) Elf (vt,40;¢0)| = 0 and E[f(v:,0;co)] # 0 for all 0 € 
© \ {20}. 


With these conditions in place, the limiting behaviour of ĉr is given by the 
following theorem. 


Theorem 7.3 Consistency of êr 
If Assumptions 7.5-7.7 hold then êr B, co. 


Before presenting the proof, we note that this theorem combined with Lemma 
7.1 imply that? 


T? fĝr(êr) — 6o]  N(0, Vo(co)) (7.37) 


Proof of Theorem 7.8: 
Notice that the stated result holds if it can be shown that MSC (co) < MSC (c) 


38 In general, there is a 0(c) which satisfies E[f (v+, 0(c);c)] = 0 for any c such that |c| = p; 
see the preamble to Chapter 4. 
39 See the discussion of Lemma 7.1. 
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for any c # co with probability one in the limit as T — oo. To establish the 
latter, it suffices to consider just two cases: (i) c = cı where E[f(v;, 61; c1)] = 0 
but cy Æ co; (ii) c = c2 where Eff (v+, 0; c2)| Æ 0 for any 0 € ©. Notice that these 
two scenarios cover all other possibilities apart from c = cy. We now consider 
them in turn. 

To simplify the notation, we define Ar (c, €o) = MSC(c) — MSC (co). From 
(7.32) it follows that 


Ar(c1,¢0) = Jr(a) + BT, ||) — {Jr(co) + B(T, |col)} 
Using (7.34) and Assumption 7.5(i), we have 
Ar(c1,¢0) = Op(1) + [A(lcol) — Aaler 
As remarked above, Assumption 7.7(i) implies that |co| > |c1| and so 
Ar(ci1,€o) = Op(1) + ker (7.38) 


where k > 0. The desired result then follows from (7.38) and Assumption 7.6(ii). 
Now consider Ar(c2, co). From (7.32), it follows that 


T~*Ar(c2,¢0) = T! {Jr(c2) + Br(T,|c2|) — Jr(co) — Br(T, |eol)} 
Using Assumptions 7.5 and 7.6, it can be seen that 
T~‘Ar(ca,¢o) = a(ca) + op(1) (7.39) 


Since a(c2) > 0 from Assumption 7.5(ii), the desired result is established. © 


Andrews (2000) reports simulation evidence on the finite sample behaviour 
of these methods in the context of a static linear regression model estimated 
by IV; see Chapter 2. Within his design, there are five regressors and the 
candidate set consists of eight instruments i.e. p = 5 and qmax = 8. Various 
parameter settings are used in which either seven or all eight instruments satisfy 
the orthogonality condition. The minimization in (7.33) is performed over a 
restricted set to make the computations managable: three cases are considered 
involving respectively 8, 12 and 17 possible selection vectors. The evidence 
suggests the model selection procedure works well for some parameter settings 
but not for others. The problems appear to stem from failure in the identification 
conditions, and we consider the ramifications of such failure in an example 
below. The evidence suggests that MSC(c) works marginally better with the 
bonus term associated with BIC given in (7.35). Another feature of the design 
is also pertinent: the maximum degree of overidentification is 3. Hall and 
Peixe (2003) reports simulation results for a similar linear regression model 
estimated by IV in which all instruments satisfy the orthogonality condition 
and the maximum degree of overidentification is 7. Within their design p equals 
one, all the instruments satisfy the orthogonality condition but six are redundant 
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given the other two. Their evidence indicates that MSC (c) tends to select all the 
orthogonal instruments with high probability as would be expected. However, 
the inclusion of the redundant instruments leads to a deterioration of the finite 
sample properties of the estimator relative to the one based on just the two 
non-redundant instruments. This finding motivates moment selection based on 
the relevance condition which is the topic of the next sub-section. 

In view of its familiarity in other contexts, it is worth considering the proper- 
ties of MSC (c) with the bonus term associated with Akaike’s (1974) information 
criterion (AIC), that is 


A(\cel) = lel — p and kr = 2 (7.40) 


It can be seen that this choice of bonus term does not satisfy Assumption 7.6 be- 
cause «r does not tend to infinity with T. A review of the proof to Theorem 7.3 
indicates that (7.39) still holds, and so the method selects moment conditions 
which satisfy the orthogonality condition with probability one. However, since 
kr Æ œ with T, (7.38) no longer implies that A(c1, co) > 0 with probability 
one. Instead, the selected vector is random in the limit — a result which paral- 
lels Shibata’s (1976) finding that AIC overfits the order of autoregressive time 
series with non-zero probability in the limit. Andrews (2000) finds this method 
performs worst in his simulation study. 

We conclude our discussion of MS'C(c) by considering the consequences of 
identification failure. These are best illustrated within the context of a simple 
example. 


Example: Identification Failure in a Linear Model 
Consider the linear model 


Ye = tbo + u 


Te = Wirt + W2tT2 + €t 


where all variables are scalars. Once again, we define u;(?) = y, — 7,6. The 
candidate set of instruments is constructed from the (8 x 1) vector w; whose 
it element is wi. The stochastic behaviour of the model depends on n, = 
[ur, e1, w,], and its assumed that n; ~ N(0, £) where © has i — j*” element oj; 


and lower triangular elements 


7 = 1 fori = j 
= Oue #0 for (i,j) = (1,2) 
0 else 


The candidate set, finax(vz,@), is assumed to consist of the (8 x 1) vector whose 
i? element is z; ¿(y+ — 740) and 


Zit = Wit fori = 1,2,...6 
= wit + ĝiu, 6; #0, fori = 7,8 
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With this specification, it is immediately apparent that 


E|zi,2ur(00)| = 0 fori = 1,2,...6 
# 0 fori = 7,8 


However, it is also the case that 


E|zi,2u+(1)] = 0 fori = 3,4,...8 
Æ 0 for i 1,2 


II 


for 6; = 0,2. Therefore Assumption 7.7(i) fails in this case because MZ° = 


fe, c9} for cy = (1,1,1,1,1,1,0,0) and cy = (0,0,1,1,1,1,1,1). o 


If identification fails in this way then the consequences are dramatic. êr 
converges to a random vector whose probability distribution attaches non-zero 
probability to both cı and c2.4° Furthermore, this non-degeneracy mainfests 
itself in the limiting behaviour of the estimator: Êr converges to a random 
variable 69(c) whose distribution takes the form: 6(c) = 0o with probability pe 
and 6(c) = o} with probability 1 — pe. So in these circumstances moment 
selection has undermined the consistency of the estimator. One further aspect 
of this case is worth noting. The limiting distribution of êr only attaches non- 
zero probability to selection vectors containing six non-zero elements. In this 
example, it can be verified that there are no instrument vectors containing seven 
or eight elements which would satisfy the orthogonality condition for some 0. 
This turns out to be a general result. Andrews (1999) shows that |@r| converges 
in probability to the largest |c| such that E[f(v:,6;c)] = 0 for some @. Since it 
is impossible to know a priori if the identification condition is satisfied, caution 
must be exercised in the use of this method of moment selection. One possible 
way forward is to use the method to identify |c|, and then examine the associated 
Jr(c) for all permutations of c with this length. However, to date, no statistical 
theory is available to guide this investigation. 

This concludes our discussion of MSC here, but we return to it in Section 
7.3.3 where the method is illustrated using our running empirical example. We 
end this sub-section by briefly considering other methods of moment selection 
based on the overidentifying restrictions test. 

In view of its hypothesis testing origins, it would seem natural to develop 
moment selection strategy based on the outcome of repeated applications of the 
overidentifying restrictions test. Andrews (1999) considers two such strate- 
gies known as “upward” and “downward” testing. As the names suggest, 
the only difference between them is the direction of testing. The upward se- 
quence involves considering choices of f(.) of dimension (p +i x 1) in the se- 
quence 7 = 1,2,... until a significant overidentifying restrictions test is encoun- 
tered. The downward sequence involves considering choices of f(.) of dimension 


40 Using a special case of this design, Peixe (2000) finds the probabilities to be 0.564 and 
0.436 for MSC based on the BIC bonus term in her simulations for sample size T = 500. 
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(dmax — i x 1) in the sequence i = 0,1,... until an insignificant overidentifying 
restrictions test is encountered. In some cases, it may be necessary to consider 
all possible choices of a given dimension; in others, it may be possible to limit 
the number of permutations considered. Whichever sequence is used, this ap- 
proach has the potential to uncover which elements of the candidate set satisfy 
the orthogonality condition. However, if a fixed significance level is used then 
this approach does not satisfy the inference condition. The problem is that, 
by construction, a 5% significance level implies the null hypothesis is falsely re- 
jected with a probability of 0.05. This makes the outcome of either the upward 
or downward testing sequences random in repeated samples. One way around 
this problem is to employ a significance level, ar, which decays to zero with T. 
Pétscher (1983) shows that a suitable rate of decay is given by In(ar) = o(T). 
Unfortunately, this type of rule does not indicate how to pick ar in a given 
sample of size T. Andrews (2000) also reports simulation evidence for these 
methods using ar = 0.276/In(T) where the scaling factor is chosen to yield 
Q259 = 0.05. He finds the selection procedures work reasonably well, but are 
dominated by MSC with the bonus term associated with BIC. Therefore, we do 
not consider these methods further here. The interested reader is referred to 
Andrews (1999, 2000).*4 


7.3.2 Selection Based on the Relevance Condition 


In this section, we describe an information criterion for moment selection based 
upon the relevance condition. When selection is based on the orthogonality 
condition, there is a natural choice of statistic to capture the sample information. 
With the relevance condition, it is not immediately obvious what constitutes 
the pertinent sample statistic. It can be recalled from Section 7.1 that the 
relevance condition is a combination of the the efficiency and non—redundancy 
conditions. Since both the latter conditions are statements about the asymptotic 
variance of the estimator, the sample analog of this variance is the natural 
basis for the sample information in an information criterion. However, this 
sample information must be a scalar and so it is necessary to find a suitable 
transformation of the variance. Hall, Inoue, Jana, and Shin (2003) show that 
the natural logarithm of the determinant of the variance is a natural candidate 
because it satisfies the following properties. 


Lemma 7.3 Properties of In||Vo(c)]|] 
Let ci € C fori = 1,2. If Vo(c1) — Vo(ce) is positive semi-definite then 
In{|Vo(c1)|] — In[|Vo(c2)|] > O with the equality only holding if Va(c1) = Vo(c2). 


Accordingly, Hall, Inoue, Jana, and Shin (2003) propose the relevant moment 
selection criterion 


RMSC(c) = In||Vo.r(o)|] + P(T, lel) (7.41) 


41 Also see Hall, Inoue, and Peixe (2003). 
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where Vy r(e) = [Gr (êr (c); c) 87 (c)Gr(Êr(c);c)]7}, and we have now indexed 
into Gr(.) and S7(.) by c. Note that the covariance estimator S(c) must be 
consistent for S(c) (using the obvious notation) but may depend on a prelim- 
inary estimator of 6). The penalty term is given by P(T,|c|). The selected 
vector is the value which minimizes the criterion over C’, that is 


čr = argmin,cg RMSC(c) 


To analyse the asymptotic behaviour of čr, it is necessary to make certain 
assumptions. As with our analysis if ĉr in the previous sub-section, three types 
of conditions are required: (i) conditions on the sample statistic; (ii) conditions 
on the penalty term; (iii) identification conditions. In terms of the first of 
these, it is far more convenient here to adopt rather high level assumptions to 
streamline the discussion; more primitive conditions can be found in Chapter 3 
or the references therein. With that caveat, we now present and discuss each of 
these three types of regularity condition in turn. 


Assumption 7.8 Regularity Conditions for Vo.r(c) 
Vo.r(c) = Vo(c) + Op(te!) where Tr > œ as T > œ. 


Notice that the statement of this assumption makes explicit reference to the 
rate of convergence of the covariance matrix. This rate depends on the rate of 
convergence of the constituents of Yo r(6). Under the assumptions in Chapter 
3, it can be shown that Gr(6r(c);c) = Go + Op(T7"2). However, the rate of 
convergence for S7(c) depends on the form of the covariance matrix. If $7(c) 
is the sum of a fixed number of autocovariances — such as Sar — then it can be 
shown that Sp(c) = S +0O,(T~/?). In this case, r = T!/?. If Ŝr(c) is an HAC 
estimator, then r(c) = S + O,((br/T)~'/?). In this case, tr = (T/br)/?.4? 
The exact rate is important because it determines the exact form of the penalty 
term. 


Assumption 7.9 Regularity Conditions for P(T, |c|) 
Forc € C such that |é| > |c|, tr[P(T, |e]) — P(T, |c|)] => +œ as T — œ and 
P(T, |eļ) = o(1). 


This assumption would be met by the choice 
P(T, |e) = (lel — p)ln(rr)/rr (7.42) 


which corresponds to the BIC-type criterion as discussed in the previous sub- 
section. 

As with MSC(c), there are two layers to the identification condition: one 
involving the selection vector and one involving 99. The first of these identifica- 
tion conditions defines c, to be the selection vector associated with the relevant 
subset. To formalize this definition it is necessary to introduce the following 


42 See Section 4.4 for further discussion. 
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sets: the set of selection vectors that are asymptotically efficient relative to the 
candidate set, 
C = {6 Valtamaz) = Vale), c € C} 


where tamas 18 a Gmax X 1 vector of ones; and also the subset of C containing the 
selection vectors of minimum length, 


Cmin = {c; c EC, |e] < |e for all c € C} 
Using this notation, we impose the following identification conditions. 


Assumption 7.10 Identification Condition 
(i) Cmin = {cr}; (ii) Elf (v, 90; c)] = 0 if and only if 0 = 0o for any cE C. 


Under these conditions, Hall, Inoue, Jana, and Shin (2003) establish the 
following result. 


Lemma 7.4 Consistency of čr 
Under Assumptions 7.8-7.10, 
čr Š cr. (7.43) 


The proof exploits Lemma 7.3 and follows similar lines to Theorem 7.3, and so 
is omitted for brevity. Note that this lemma combined with Lemma 7.1 imply 
thatt? 


T'(6r(ér) — 6o] % N(0, Va(cr)) (7.44) 


Hall, Inoue, Jana, and Shin (2003) report simulation evidence for RMSC in 
the context of IV estimation of linear regression model with a single regressor 
xı and max = 8 so that the maximum degree of overidentification is seven.*4 
Within their design, all the potential instruments satisfy the orthogonality con- 
dition but six are redundant given the other two. The evidence suggests that 
the performance of the method is sensitive to both the R? from the regression 
of x+ on the intruments and also the degree of endogeneity of the x+. If the 
R? equals 0.5 then the method does a good job of identifying which moment 
conditions are informative about the regression parameter, and the behaviour 
of br (Ep) is well approximated by conventional asymptotic theory in samples 
of size T = 100. If the R? equals 0.1 then RMSC has problems identifying 
which moment conditions are informative about the regression parameter, and 
the behaviour of Êr(čr) is not well approximated by conventional asymptotic 
theory in samples of size of T = 100.45 However, by T = 500, the method per- 
forms much better and Êr(čr) is well approximated by conventional asymptotic 
theory except for cases where x; is highly endogenous.46 


43 See the discussion of Lemma 7.1. 

44 See Chapter 2. 

45 Also see Section 8.2. 

Here “highly endogenous” means the correlation between x; and the error of the equation 
— uz in the notation of Chapter 2 — is 0.9. 
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7.3.3 A Combined Strategy 


In practice, the candidate set, fmaz(Vz,90), is most likely to contain some el- 
ements which satisfy the orthogonality condition and some which do not. Of 
these orthogonal instruments, only a subset may satisfy the relevance condition. 
Therefore, it is desirable to develop a method which selects moments on the ba- 
sis of both the orthogonality and relevance conditions. Selection based on either 
MSC(c) or RMSC(c) alone cannot meet this objective because each is based on 
only one of the conditions. However, intuition suggests that a combination of 
the two methods should achieve the desired goal. This section explores the 
properties of such a selection strategy. 
So we now assume that the candidate set is made up as follows. 


Assumption 7.11 Candidate Set 

fmax(v14,9) = [f (vt, 9; Co)’, f (vt, 9; cx)/|’ where co is defined in Assumption 7.7, 
and f(v4,9;¢o) = [f (vt, 0; cr)’, F(t, 93 c:)'|’ where cp is defined in Assumption 
7.10. 


For the sake of exposition, we assume that MSC is applied first, and then RMSC. 
The sequence does not affect the essence of the theoretical arguments below, but 
may potentially have consequences in finite samples in practice. Since RMSC (c) 
is to be applied following MS'C(c), it is necessary to modify the definition of čr 
to reflect the fact that the minimization is over a candidate set delineated by 
the first selection criterion.” Accordingly, we define the set 


Êr = f c e Rl; c; = 0,1, for j =1,2,...|ér| 
and c = (c1, c2,- - jaz), lel >p} 
and redefine čr as follows, 


ep’ = argmin cêp RM SC (c) 
The following theorem establishes the consistency of this sequential method 
of moment selection. 


Theorem 7.4 Consistency of č% 
If Assumptions 7.5 - 7.11 hold then: & 4 2B, cp. 


The proof follows directly from a combination of Theorem 7.3 and Lemma 7.4, 
and so is left to the reader. It follows directly from Theorem 7.4 and Lemma 
7.1 that the asymptotic distribution of 67(@°") is given by*® 


T! [6 (8°) — 0o) “+ N(0, Vo(cr)) (7.45) 


47 See Section 7.3.1 for discussion of circumstances in which it is desirable to minimize 
MSC(c) over subsets of C. 

48 This assumes f (vt, 0; Co) satisfies the regularity conditions of Theorem 3.2. Also see the 
discussion of Lemma 7.1. 


7.3 Moment Selection in Practice 263 


To date, there have been no simulation studies exploring the finite sample be- 
haviour of this combined method of moment selection, and this is an interesting 
area for future research.*° We now illustrate both MSC and RMSC using our 
running example. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 
Our previous empirical implementation is based on the population moment con- 
dition 
Ele (50274 122,t41 — 1] =0 

where %1 441 = Cr41/Ct, T2,t+1 = Te4i/P_ and z = (1, £1,t, 2,2; 214-1, T241] 
It was remarked at the outset that this choice of instrument vector is arbitrary, 
and we now consider the performance of the model with the population moment 
condition 

Elf (ve,6050)] = Elala tzn — = 0 


for c= cj, i = 1,2,...5 where 


2(c3) = 
/ 
z4(c4) = 1,214, 024,01, 1, %2,t—-1, T1,t—2, V2,t-2, T1,t-3, V2,t 3] 
$ 
z(cs) = Lorin tot 01, 1; T2,t—1; T1,t—2, T2,t—2, V1,t-3, T2,t—3, V1,t-4, T2,t 4] 


Notice that c = c2 gives the moment condition used in our earlier empirical 
implementation of this model. Table 7.1 reports the values of MSC(c) and 
RMSC (c) for these five choices of c. 


Table 7.1 
MSC(c) and RMSC (c) for certain choices of instrument 
vector in the consumption based asset pricing model 


VWR EWR 
1 —5.546 0.970 2.141 1.682 
2 —16.671 1.235 —6.308 1.929 
3 — 28.806 1.186 —17.920 1.848 
4 —37.438 1.542 —26.617 2.193 
5 —41.294 1.778 —37.446 2.428 


Notes: MSC(c;) is given by (7.32) with Sp = ÎsU,u from (4.24) and bonus term given by 
(7.35); RMSC(c;) is given by (7.41) with Sp = Sgy from (3.40) with penalty term given by 
(7.42) with r = T!/?. 


49 See Section 7.3.4 for discussion of a related issue. 
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Consider first the results for value weighted returns (VWR). The value of MSC 
falls as the number of instruments increases, and so cs is the prefered choice from 
this limited set. The overidentifying restrictions test statistic associated with 
this choice, Jr(cs), takes the value 13.9262 which implies a p-value of 0.1250. 
Therefore, this choice of moments appears valid. However, RMSC indicates that 
the choice cı is prefered. It is interesting to contrast the parameter estimates 
and their associated confidence intervals for these two choices of instrument. If 
c = c5 — the choice selected using MSC — then 47 and ôr are 0.991 and 1.627, 
and their respective 95% asymptotic confidence intervals are (0.985, 0.998) and 
(—1.496, 4.751). If c = cı — the choice selected using RMSC — then 47 and ôr 
are 0.994 and 0.593, and their respective 95% asymptotic confidence intervals 
are (0.987, 1.001) and (—3.026, 4.212).50 These results provide an illustration of 
the sensitivity of inferences to the choice of moment condition. We now con- 
sider what happens if fmaz(v:, 0) = zics) (or 122,41 — 1) is treated as the 
candidate set and the moment selection is performed by minimizing RMSC (c) 
over C. One immediate problem is the computational burden associated with 
allowing for so many possible choices of instrument vector. For purposes of 
comparison, we implemented the search two ways: using the two step estimator 
and the iterated estimator with a maximum of twenty iterations. Interestingly, 
both versions led to the same selected vector, 2:(ér) = (£1,t, £1,t—1,;%2,t—2), 
and a minimimized value for RMSC of 0.807. This agreement raises the pos- 
sibility that little may be lost by limiting the number of iterations, but fur- 
ther work is needed to explore whether this finding extends to other settings. 
The resulting iterated GMM parameter estimates are (Ẹr, ôr) = (0.994, 0.611), 
and their respective 95% asymptotic confidence intervals are (0.987, 1.001) and 
(—2.683, 3.906). 

Now consider equally weighted returns (EWR). The value for MSC exhibits 
a similar pattern to the case with VWR. However, this time Jr(cs), takes the 
value 17.7745 which implies a p-value of 0.0379, and so indicates this choice of 
moments is invalid. Therefore, we do not consider this case further. 


7.3.4 Other Methods of Instrument Selection 


Both MSC and RMSC are valid for the GMM framework. There has also been 
some recent work on the problem of instrument selection within the framework 
of the GIV estimator described in Section 7.2. In this section, we review two 
particular methods: an information criterion for instrument selection based on 
the relevance condition proposed by Hall and Peixe (2003) and a method based 
on minimizing an approximation to the mean square error proposed by Donald 
and Newey (2001). Each method is designed to address the issue of instrument 
selection in classes of problems encountered in practice. However, neither is 
applicable in the general GIV framework. In view of this lack of generality, we 
only provide a heuristic discussion of the methods. Before we describe these 
two methods, it is worth noting that the same basic question was addressed in 


50 These figures are for the iterated estimator with Sr E su. 
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the context of IV estimation of linear simultaneous equation models back in the 
1960’s. Fisher (1965) and Mitchell and Fisher (1970) respectively introduce and 
refine the method of “structurally ordered instrumental variables”. However, 
we do not review their work here because it does not extend to the types of 
model in Table 1.1.51 

Hall and Peixe (2003) consider the problem of instrument selection when the 
moment condition takes the form 


E|z:(c)uz(00)| = 0 


where w:(@) = u(v,9) is a scalar function, u+(0o) is a martingale difference 
sequence with respect to Q:-1, E[u:(00)?|Q:_-1] = o and z(c) € Q-1. They 
are concerned with developing a method for selecting the moment condition — 
or instrument — which satisfies the relevance condition. All the members of the 
candidate set are assumed to satisfy the orthogonality condition. Their method 
is motivated by considering the form of the asymptotic variance of the estimator. 
Under their conditions, the asymptotic distribution of the GIV estimator is 


T'?[6r(c) — 8o] > N(0,Vo(c)) 

where”? 

V(c) = ap A(c)A(c)~? Al) (7.46) 
A(c) = diag (pi(c),.-- Pp(e)), {p:(c);i = 1,2,...p} are defined to be the canoni- 
cal correlations between Ouz(00)/00 and z,(c), and A(c) is the px p matrix whose 
it row, a;(c) , contain the weights in the linear combination of Ou;(99) /O0 as- 
sociated with the i” canonical correlation.” The form of this variance suggests 
that the canonical correlations may provide a basis for selection based on the 
relevance condition. In fact, Hall and Peixe (2003) establish that z;(c2) is re- 
dundant given z;(c1) if and only if 


pi(ci + c2) = pi(cr) i = 1,2,...p 


Therefore, Hall and Peixe (2003) propose an information criterion for moment 
selection which exploits the information in these canonical correlations. They 
refer to this criterion as the canonical correlation information criterion (CCIC). 
Since fo is unobservable, CCIC is based on the sample canonical correlations 
between and Ou;(Or)/00 and z%(c) where Õr is some preliminary estimator. 
Using r;r(c) to denote the it” such canonical correlation, CCIC is given by°4 


CCIC(c) = T) in [1 —rir(c)?] + (lel -— p)in(T) (7.47) 


51 Also see Hall and Peixe (2000) for further discussion. 

52 This type of decomposition for the asymptotic variance was first presented by Sargan 
(1958) in his study of IV estimators in linear models. 

53 See inter alia Anderson (1984) [Chap. 12] for further discussion of canonical correlations. 

54 This version of CCIC uses the BIC version of the penalty term. Hall and Peixe (2003) 
also consider the behaviour of their method with the HQIC and AIC type penalty terms. 
However, simulation evidence suggests the method works best with BIC and so we do not 
consider the other versions here. 
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SOD 


The selected instrument vector is given by 2,(é9"°) where 
ee’ = argminceo CCIC(c) (7.48) 


Hall and Peixe (2003) report simulation results for a static linear regression 
model. Overall, their evidence suggests the method is successful in screening 
out redundant instruments, and that selection based on relevance leads to a 
considerable improvement in the quality of the asymptotic approximation to 
the behaviour of GIV estimator. They also report simulation results for the 
case in which MSC(c) and CCIC(c) are used sequentially. Interestingly, they 
find the ordering can make a substantial difference to the performance of the 
sequential method in finite samples. Within their design, it proves beneficial to 
use CCIC (c) first and then MSC(c). The interested reader is refered to Hall 
and Peixe (2003) for further discussion of this issue. 

Donald and Newey (2001) consider the problem of instrument selection for 
a type of model that is encountered in cross-sectional studies in labour eco- 
nomics. In these studies, the focus of attention is often on the point estimate 
of a particular parameter, and so the finite sample precision of this estimate is 
a more appropriate criterion for instrument selection than the orthogonality or 
relevance conditions. However, since the resulting method is only applicable to 
iid. data, we confine our discussion to a special case of their method in order 
to illustrate the basic approach. The interested reader is refered to Donald and 
Newey (2001) for a more detailed discussion including simulation evidence and 
an empirical example. Within their framework, it is assumed that the researcher 
wishes to estimate the following single equation by instrumental variables 


ye = Vero + 21480 + wu (7.49) 


where xı ¢ are a vector of exogenous variables but Y, is a vector of endogenous 
variables generated by 

where x, = (Tit T34). The variables (x,, us, e) are assumed to be i.i.d. The er- 
rors are assumed to satisfy: E[u;|«;] = 0, Eļer|æ:] = 0, Var[(ur, e,) (ue, ep) lz] = 
Q and Covfez, ut|x:] Æ 0. It is convenient to stack the unknown parameters into 
a single vector and so we set 0 = (7,6), and also to introduce the stacked 
system 

= | Yı | = 
T1At 

where d(x) = [A(ze) 214] and w: = [e,,0’] The candidate set of instruments 
is assumed to consist of elements of the form zti = zi(x+). Notice that by 
construction all instruments satisfy the orthogonality condition. Unlike the 
other methods described above, this framework allows for the dimension of the 
candidate set to increase with T at some rate which is restricted by the theory 
as described below. Donald and Newey (2001) propose choosing c to minimize a 
Nagar type approximation to the mean square error (MSE) of the estimator.®° 


55 See Section 6.2.2 for further discussion of Nagar approximations. 
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To make this approach operational, Donald and Newey (2001) assume that 
the researcher is interested in a linear combination of the parameters rather 
than the parameter vector itself, and they also propose substituting preliminary 
estimates for any unknown nuisance parameters which appear in the formula. 
In the case of Two Stage Least Squares estimator, this approach leads to the 


following estimated approximate MSE for the linear combination Nr Or(c):° 
AMSE(c) = 6} ee 621Ry(c) — 52 le 
A,u T u T 


where 62 = &ù/T, 62 = Ap D-t WD" Az /T, oy = Ap G/T and D 

is a preliminary estimator of T~+ at d(x+)d(x+)', W is a residual vector from 

a preliminary estimation of (7.51), & is a residual vector from a preliminary 

estimation of (7.49), and R,(c) is a measure of the goodness of fit for the 
estimation of (7.51) using z;(c). One possible choice for R, (c) is 

bie Bae ad 

hie Ar D tú (c) 


ûle) D-t ir A 263 |c] 
T T 


ûle) = {Ir — Z(Y ZAZA N, Z(c) is the T x |c| matrix with t'” 
row z(c) and N is the T x n matrix with t” row N,. The selected instrument 
vector is the one which minimizes the estimated approximate MSE, that is 


emse = minecec AMSE(c) 


Donald and Newey (2001) provide conditions under which AMSE (c) converges 
in probability to the true MSE, and these include the requirement that the can- 
didate set expands at a rate slower than T!/?. Under these conditions, @7/°* can 
be considered optimal in the sense that it minimizes the MSE asymptotically 
with probability one. However, they do not consider the asymptotic distribu- 
tion of 67 (ems), Intuition suggests that the inference condition is satisfied 
under certain regularity conditions, but the characterization of these regularity 


conditions remains a topic for future research.” 


7.4 Summary 


In this chapter, we have considered the problem of moment selection. The 
desirable properties for the selected moment depend upon the ultimate objective 
of the study in question. For the majority of this chapter, it is assumed that 
this objective is to perform inference about fọ based on the asymptotic theory 
derived in Chapters 3 and 5. Given this context, it is argued that it is desirable 
for the selected vector to satisfy: 


56 Donald and Newey (2001) also consider the Limited Information Maximum Likelihood 
estimator and a bias adjusted version of 2SLS. See their article for further discussion of these 
two cases and comparisons between all three estimators. 

57 See Section 6.1.3 for a discussion of the available asymptotic distribution theory if the 
number of moment conditions increases with T. 
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e the orthogonality condition — so that the estimation is based on valid in- 
formation; 


e the efficiency condition — so that inference is based on the asymptotically 
most precise estimates; 


e the non-redundancy condition — so that the selected moment condition 
does not contain any redundant elements whose inclusion can cause a de- 
terioration in the quality of the asymptotic approximation to finite sample 
behaviour. 


There have been two approaches to this issue in the literature. The first 
approach is to characterize theoretically the optimal choice of moment condition. 
The second approach is to develop data based methods for moment selection. 
We now briefly summarize the results on each. 


e The optimal moment condition: Given that only asymptotic distribution 
theory is available, the optimal choice is the one that satisfies both the 
orthogonality and efficiency conditions. Given this criterion, the optimal 
moment condition is always the score vector because the resulting GMM 
estimator is the MLE. Unfortunately, this choice is infeasible in the types 
of model listed in Table 1.1. Therefore, it is necessary to restrict the 
search for the optimal moment to settings encountered in practice. For 
the class of Generalized Instrumental Variable estimators, it is possible 
to characterize the functional form of the optimal instrument in terms 
of the information set. However, the optimal instrument is infeasible in 
most nonlinear dynamic models because its construction requires knowl- 
edge of aspects of the data generation process which are typically not 
specified as part of the economic model. However, knowledge of the form 
of the optimal instrument facilitates efficiency comparisons between ML 
and GIV. To date, such comparisons indicate that GIV can be as asymp- 
totically efficient as ML in certain linear models with normal errors, but 
this equivalence does not extend to the general nonlinear model. 


e Data based methods for moment selection: In most circumstances, a re- 
searcher must decide which moments to choose without knowledge of the 
underlying data generation process. In such circumstances, moment selec- 
tion must perforce be based upon the data, and it is therefore important 
that the use of the data in this way does not contaminate the limiting dis- 
trubtion theory. This consideration yields a fourth desirable property for 
a moment selection procedure that is termed the inference condition. To 
date, this problem has mostly been approached using information crite- 
rion. The moment selection criterion (MSC) is designed to select moments 
on the basis of the orthogonality condition, and the relevant moment se- 
lection criterion (RMSC) is designed to select moments on the basis of 
a combination of the efficiency and non-redundancy conditions that is 
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termed the relevance condition. Under certain conditions, these meth- 
ods each satisfy the inference condition. MSC and RMSC can be used 
individually or sequentially. 


The preliminary evidence suggests that the use of MSC and RMSC can help 
a researcher to avoid situations in which asymptotic theory provides a very poor 
approximation to finite sample behaviour. However, it is also clear that their use 
is not a panacea for the finite sample deficiencies of the conventional asymptotic 
theory that are documented in Chapter 6. Therefore, in the following chapter, 
we explore a number of alternative asymptotic approximations to the finite 
sample behaviour of the GMM estimator. 


8 


Alternative Approximations 
to Finite Sample Behaviour 


In Chapter 6, it is seen that the available simulation evidence indicates that 
the asymptotic theory developed in Chapters 3 and 5 may not provide a good 
approximation to the finite sample behaviour of the GMM estimator in certain 
circumstances of interest. The situation can be ameliorated by careful selection 
of moment conditions, and this motivates the methods described in the previous 
chapter. However, while the use of such moment selection procedures may lead 
to an improvement, the overall quality of the asymptotic approximation may 
still leave something to be desired. Therefore, in this chapter, we consider three 
alternative methods for approximating the finite sample behaviour of the GMM 
estimator and its associated statistics. These three are: (i) the bootstrap; (ii) 
an asymptotic theory developed for the case in which the parameter vector is 
weakly identified by the population moment condition; (iii) and an asymptotic 
theory designed to provide a better approximation when the weighting ma- 
trix is based on a heteroscedasticity autocorrelation covariance (HAC) matrix 
estimator. 


In Section 8.1, we discuss the use of the bootstrap which is a resampling 
technique that has — at least in theory — the potential to improve the quality of 
the approximation in any model. This potential has been successfully realized 
in many areas of statistical inference, and so the method is a natural candidate 
for improving the quality of inferences based on GMM estimators. However, 
it turns out that the extension of the bootstrap to this setting is not so sim- 
ple in terms of both implementation and also the verification that it yields an 
improvement. In particular, complications arise in overidentified, nonlinear dy- 
namic models. While considerable progress has been made in circumventing 
these complications, the available analysis does not yet encompass the general 
framework employed in Chapter 3. Section 8.1.1 provides a brief review of the 
ideas behind the bootstrap. Section 8.1.2 describes the steps involved in the 
application of the bootstrap to nonlinear dynamic models. 
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In Section 8.2, we describe an alternative asymptotic theory that has been 
developed for the case in which the parameter vector is weakly identified by the 
population moment condition. Equivalently, this scenario can also be termed as 
the case in which the population moment condition is “nearly uninformative” 
about the parameter vector. To date, this problem has mostly been encountered 
in models estimated by Generalized Instrumental Variables (GIV). Therefore, 
we focus our discussion on this case but the qualitative conclusions extend to 
the GMM setting. Section 8.2.1 presents the limiting behaviour of the GIV 
estimator. Section 8.2.2 presents methods for performing inference within this 
scenario. Section 8.2.3 discusses the detection of poor identification. 

In Section 8.2.3, we return to the problem of bandwidth selection for HAC 
matrix estimators of the long run variance. In Section 3.5.3, we review the liter- 
ature on bandwidth selection when the aim is to provide a consistent estimator 
of the long run variance. It can be recalled that, to date, there is no definitive 
rule for making this selection. One way to remove this ambiguity is simply to 
set the bandwidth equal to the sample size. While this choice does not satisfy 
the conditions for consistency, it does lead to an alternative asymptotic theory 
upon which to base inference about the parameters. This alternative theory is 
briefly reviewed in Section 8.2.3. 

Since all three alternative approximations are relatively new and so not yet 
widely applied, the discussion here is less technical than before and no formal 
proofs are provided. Instead, the focus is placed on the intuition behind the 
three approaches, and on practical matters. 


8.1 The Bootstrap 


8.1.1 Background and Intuition 


Efron (1979) introduced the term “bootstrap” as a generic name for methods of 
statistical inference based on resampling techniques. By their very nature, re- 
sampling techniques can be computationally burdensome but, with advances in 
computer technology, it has become feasible to apply the method in increasingly 
complex settings. These advances have stimulated a considerable literature in 
statistics on the bootstrap where the method has been used both for the esti- 
mation of bias, variance, and distribution functions, and also for the reduction 
of errors made in the use of approximate significance levels of tests or coverage 
probabilities of approximate confidence intervals. However, it is only relatively 
recently that researchers have considered applying the method in the context 
of GMM. Hall and Horowitz (1996) provide the first treatment of the bootstrap 
based on GMM in the context of nonlinear, dynamic models, and our discussion 
rests heavily on their work.! It is beyond the scope of this book to provide a 
comprehensive review of the more general literature on the bootstrap in statis- 


1 An alternative approach is to base the bootstrap upon the Empirical Likelihood. How- 
ever, since this method has only been developed for i.i.d. cases, we do not not discuss it here 
but do return to it as part of the discussion of Empirical Likelihood in Section 10.2. 
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tics. Instead, the interested reader is refered to Hall (1994) and the references 
therein. 

The idea behind the bootstrap is best understood by considering a simple 
example in which the method is used to reduce the errors made in the use of an 
approximate 100a% significance level test. Let {u,;t = 1,2,...T} be a sample 
of independent random draws from a common distribution with mean fo and 
variance of. In terms of our framework, the parameter vector is 09 = [uo, 06]. 
It is natural to estimate ĝo from the first two moment conditions, and so, using 
(1.1)—(1.2), it follows that: 


Roe lsat. 
1 T! Jy (0 r)? 
where tr = Tis vg. Suppose that it is desired to test the hypothesis 
Hg : po = 0 versus H; : uo Æ 0 based on a sample of size T. The natural test 
statistic is the t-ratio, 

T? r 


OT 


TT = (8.1) 
The decision rule of the test involves the comparison of |rr| with a percentile 
from some distribution. The key question is: what is the appropriate distribu- 
tion? Before we consider how the bootstrap can be used to answer this question, 
it is useful for purposes of comparison to review two more familiar choices of 
distribution and the properties of the ensuing tests. 

If the true distribution of Tr is known then it is possible to perform an exact 
test. If F[r] denotes the true cumulative distribution function of tr then the 
decision rule for the test is as follows: 


Test based on the finite sample distribution: Reject Ho if |rr| > cer(a/2) 


(8.2) 
where Fr[cr(a/2)]| = 1 — a/2. This version is said to be an “exact” 100a% 
significance level test because the probability of a type I error is a.? Clearly, 


this exact test can only be performed if the true distribution function is known. 
Unfortunately, this is rarely the case. Therefore, inference is most commonly 
based upon the limiting distribution, which for Tr is the standard normal dis- 
tribution. If ®[7] denotes the cumulative distribution function of the standard 
normal distribution then the decision rule based on the limiting distribution 
takes the form: 


Test based on the limiting distribution: Reject Ho if |rr| > Coo(a/2) (8.3) 


where ®[c..(a/2)] = 1—a/2. With the decision rule in (8.3), the true probability 
of a type one error is 2{1— Fr[coo(a/2)]}, and this is only guaranteed to be a in 
the limit. Therefore, it is not an exact 100a% significance level test for finite T, 


2 Such an exact test is most often encountered in circumstances when {vt} are random 
draws from a normal distribution because then (y (T — 1)/T)rr has a Student’s t distribution 
with T — 1 degrees of freedom. 
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but is instead refered to as an “approximate” 100a% level test in finite samples. 
As with all approximations, there is an error and it is the desire to reduce this 
error that motivates the use of the bootstrap. 

The bootstrap version of the test is based on an alternative approximation to 
the distribution of Tr that is obtained via resampling from the observed sample. 
The decision rule for this version of the test takes the form: 


Test based on the bootstrap distribution: Reject Ho if |rr| > cR(a/2) (8.4) 


where c#(a/2) is calculated as follows. 


e Draw N samples of size T with replacement from the observed sample. 


Let the nt? such sample be denoted (v(”), v§”,... oh), 


e For each of these N samples, calculate the statistic 


~(n) page — vr) 


P AS e n=1,2,...N 
on 
where oe =T as 1 vt ) and 6 ô Ta = yT- DL y = a). 


e cB(a/2) is the 100(1 — a)" percentile of the empirical distribution of 
(ep? bite etr N: 


Notice that the critical point is calculated using the empirical distribution of 
the absolute value of 77. This transformation is taken because it is the absolute 
value of Tr that appears in the decision rule. Notice also that the t-statistic in 


the bootstrap, A), is centred about the sample mean, and so is different from 


the original t-statistic. This correction is needed to ensure that E|7> (n 9j = = 0, and 
hence that the bootstrapped distribution mimics the first moment ohoparties of 
Tr under Ho regardless of the true value of uo. The bootstrap version of the test 
is also only an approximate 100a% significance level test but intuition suggests 
that the involvement of the data yields a test whose size is closer to a than its 
counterpart based on the limiting distribution. This turns out to be the case, 
and a formal justification comes from consideration of Edgeworth expansions of 
both the true and the bootstrap cumulative distribution functions.’ 

To begin, we consider the Edgeworth expansion of the true cumulative dis- 
tribution function. Under certain regularity conditions, it can be shown that 


Pirr <d = ®(c) + Tml) + T harle) + oT) (8.5) 


uniformly over c where hı r(c) and h2,r(c) are respectively even and odd func- 

tions of c for any T. The properties of {h; r(.); i = 1,2} mean that a convenient 

cancellation takes places when we consider the probability that |r| < c for any 
c > 0. Specifically, it follows from (8.5) that: 

Plirr| <c] = Plrr<cl — P|r < -cç] (8.6) 

= (à) — (=c) + 2 harle) + off") (8.7) 


3 Also see Section 6.2.2. 
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Since 6(—c) = 1 — (c), equation (8.7) can be simplified further to yield: 
Plirr| <e] = —1 + 2®(c) + 2T~*her(c) + o(T7}) (8.8) 


For our purposes, it is more convenient to focus on the expansion for P[|rr| > cl]. 
Using (8.8), it follows that 


P\lrr| >c) = 1 Plirr| <c] 
2{1 — ®(c) — Tha r(c)} + off) (8.9) 


Equation (8.9) can be used to provide insights into the nature of the “approxi- 
mation” in the approximate 100a% significance level test based on the limiting 
distribution of rr. Putting c = cxo(a/2), it follows from (8.9) that the true 
significance level of this version of the test is 


Pl\tr| > €o0(a/2)] = a + O(T~") (8.10) 


Therefore, the test based on the limiting distribution has an exact significance 
level that deviates from 100a% by a term of large order T71. 

A similar expansion can be developed for probabilities based on the boot- 
strap distribution. In practice, this distribution is a function of N, the number 
of replications, but the theoretical justification derives from considering the lim- 
iting bootstrap distribution obtained as N — co. This clearly raises the issue 
of how N should be chosen, and this is considered in Section 8.1.2.3. A second 
important aspect of this distribution is that it is conditional on the observed 
sample and so is itself subject to sampling variation. This dependence is indi- 
cated by inserting a B superscript on P{.]. It should also be noted that this 
randomness manifests itself in the percentile c2?(a/2), and this feature becomes 
important at certain points in the argument. With these features in mind, 
we can now consider the Edgeworth expansion for P?[.].. Under appropriate 
regularity conditions, it can be shown that 


PP ler < d = O(c) + TV?*hP pc) + To hZr(c) + op(T) (8.11) 


uniformly over c where h?,,(c) and h?,,(c) are respectively even and odd func- 
tions of c. Since h?,(.) have the same properties as their counterparts in the 
expansion for the true CDF, we can repeat the argument above to deduce 


PP |r| > ce] = 2{1 — (c) — T thËr(c)} + 0,(T~*) (8.12) 


A comparison of (8.9) and (8.12) indicates that the probabilities based on the 
true and bootstrapped distributions differ. However, it can be shown that 
T~th&®,(c) converges almost surely to T~tha,r(c) as T — oo. Using this result, 
(8.12) can be re-written as 


PP ltr] >] = 2{1 — O(c) — Tha r(c)} + 0,(T') (8.13) 


and so it can be recognized that the probabilities based on the true and boot- 
strap distributions are equal through terms of order T~!. All that remains 
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is to show that this equivalence implies that the use of the bootstrap version 
of the test yields more accurate inference than the test based on the limit- 
ing distribution. This is established in two steps. First, it is shown that 
cB(a/2) = er(a/2) + op(T~+). Secondly, it is shown that the preceeding rela- 
tionship between the percentiles implies that P[|tr| > c#(a/2)| = a + 0(T7?). 
The details follow; for this part of the presentation we set cr = cr(a/2) and 
cB = c2(a/2) to avoid excessive notation. 
The first step can be established by considering 


dr(cB) = (c) + T'ha r(ch) (8.14) 


Using a Mean Value Theorem expansion of dr(c#) around dr(cr), it follows 
that 

dr(c?) = dr(cr) + Dr(ér)(cR E cr) (8.15) 
where Dr(c) = Odr(c)/Oc and Er = Arc# + (1 — Ar)cr for some Ar € [0,1]. 
Simple rearrangement yields 


dp(cp) — dr(er) = Dr(ér)(c% — cr) (8.16) 


It turns out that |Dr(ér)| is finite and bounded away from zero for non-zero 
a although we do not present the details here.+ Therefore, the desired result 
follows if it can be shown that dr(c#) — dr(cr) = 0,(T~1). The latter can 
be established by manipulating the expansions derived above. Since P||rr| > 
cr] = a by construction, it follows from (8.9) that 


1 — (er) — T~ther(er) = a/2 + o(T7}) (8.17) 
Similarly, since P?[|77| > cË] = a, it follows from (8.13) that 
1 — (c) — T tha r(c#) = a/2 + oTt) (8.18) 


Taken together (8.17) and (8.18) imply that dr(c?) — dr(cr) = op(T~1), and 
so it follows from (8.16) that 


cB = cr + oT’) (8.19) 


This completes the first step of the argument. 

To establish the second step, it is useful to express the probabilities P||rr| > 
cr] and P[|rr| > c#] in terms of indicator functions. To this end, define Z(A) 
to be an indicator function that takes the value one if event A occurs and is 
zero otherwise. Using this notation, we have 


P\\rr| > cr] = E|Z(|rr| > cr)| (8.20) 
Pl|rr| > cF] E(Z(|rr| > cF) (8.21) 


II 


It is therefore possible to compare the probabilities by comparing the underlying 
indicator functions. It can be recognized that Z(|rr| > er) and Z(|rr| > cB) 


4 See Hall (1994). 
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agree if either |rr| > max{c}, cr} or min{cr,c2} > |rr| because in these cases 
we have 


ltp| > cr) = L(|rr| > cp) = 1 
Irr| > er) = T(|rr| >c) = 0 


Irr| > maz{c, cr} > T 


—_~ 


min{er, cP} > |rr| > fT 


However they disagree if either c? > |Tr| > cr or cr È Tr > ee because in 
these cases we have 


l 
(æ 


cÈ > |trl>er > T(\tr|>er) = 1, L(|\tr| > cf) = 
cr > |rr| > = T(\rr|>cr) = 0, T(\rr| > eB) = 1 
Using these relations, it is clear that 
T(|rr| > cB) = L(\rr| > er) -IË > |rr| > er) + Ler > |rr| > cB) (8.22) 
The substitution of (8.22) into the right hand side of (8.21) yields 


Plitr| > cf] = E[Z(|rr| > er)] — E[Z (cP = \rr| > er) 
+ E[Z(cr > |rr| > cB) (8.23) 


Equation (8.20) and the definition of cr imply that the first term on the right 
hand side of (8.23) is a, and (8.19) implies that the other terms on the right 
hand side are collectively o(T~'). Substituting these results into (8.23) yields 


Plirr| > c#] = a + o(f~*) (8.24) 


A comparison of (8.10) and (8.24) reveals the potential gains from the use 
of the bootstrap. The use of the bootstrap involves an approximation error in 
the significance level of o(T’~'); whereas basing inference on the limiting distri- 
bution involves an approximation error of O(T~'). The bootstrap is therefore 
said to provide “asymptotic refinements”. In more general settings, the boot- 
strap yields such asymptotic refinements in cases where inference is based on an 
asymptotically pivotal statistic, that is a statistic whose limiting distribution is 
independent of the distribution of the data. While this asymptotic refinement 
motivates the use of the bootstrap, it should be noted that order statements 
only reveal something about the rate at which the error decreases. They do not 
tell us anything about the magnitude of the error for a given T, and so there 
is no guarantee that the bootstrap yields more reliable inference procedures in 
every case.° 

The above discussion has focused on a very simple case in which the data 
are independently and identically distributed and the parameter vector is just 
identified by the population moment condition. The key question is whether the 
method and the theoretical arguments can be extended to the case where the 
data are dependent, the parameter vector is overidentified and the population 
condition is nonlinear in the parameters. The answer is in the affirmative but 
subject to some important qualifications. This is the topic of the next sub- 
section. 


5 For example, if the order TT! term in (8.9) may be 10~°T~! then the use of the bootstrap 
is unlikely to yield a significant improvement over inference based on the limiting distribution. 
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8.1.2 Nonlinear Dynamic Models 


Hall and Horowitz (1996) provide the first rigorous treatment of the bootstrap 
based on GMM estimation of the parameters in nonlinear dynamic models. 
Their analysis deals with the use of the so-called block bootstrap with non- 
overlapping blocks to reduce the error in the significance level of the overi- 
dentifying restrictions test statistic and the two-sided t-statistic for testing 
Hoi : 9,; = 90,i- Andrews (20026) extends Hall and Horowitz’s (1996) analysis 
in a number of directions that include the use of the block bootstrap with either 
overlapping or non-overlapping blocks and a consideration of a broader array of 
inference procedures. Our discussion covers both versions of the block bootstrap 
but, for brevity, we restrict attention to just two statistics, the overidentifying 
restrictions test and the two-sided confidence intervals for 09,;. Therefore, this 
sub-section is based on a synthesis of certain results in Hall and Horowitz (1996) 
and Andrews (2002b) and relies heavily on these sources. 

As discussed in the previous sub-section, the theoretical justification for the 
bootstrap derives from Edgeworth expansions. Such expansions are only valid 
under certain regularity conditions, and these conditions turn out to be far more 
restrictive than those used to underpin the asymptotic analysis in Chapters 3 
and 5. Although we do not consider these Edgeworth expansions here, we begin 
this sub-section with a discussion of the necessary regularity conditions in order 
to highlight the key differences from the earlier analysis. After that, we describe 
the mechanics of applying the block bootstrap within the GMM setting. 

The discussion here is premised on the assumptions that the overidentifying 
restrictions test has the limiting chi-squared distributions given in Theorems 
5.1 and T!/2(6p — 8o) has the limiting normal distribution given in Theorem 
3.2. In addition to the regularity conditions required for these results, Hall and 
Horowitz (1996) and Andrews (2002) impose a number of other conditions. 
For our purposes here, it suffices to focus on the conditions that most obviously 
restrict the model in comparison to the framework in Chapters 3 and 5. These 
conditions involve the dependence structure of v+, the autocovariance structure 
of f(vz,09), and the composition of vs. The interested reader is refered to the 
aforementioned sources for a complete listing of the required conditions.°® 

It can be recalled that our asymptotic analysis is premised on the assumption 
that the data are stationary and ergodic. As discussed in Appendix A, this 
assumption places restrictions on the memory of the process. To implement the 
bootstrap, it is necessary to restrict this memory further. 


Assumption 8.1 Approximation of v, by a m—Dependent Process 
There is a sequence of s x 1 i.i.d. vectors {e;,}?_., with s >randarx1 
function h such that the rx1 vector v; can be written as ve = h(ez, er-1, €r-2,---)- 
There is a constant d > 0 such that for allt =1,2,... and allm > d`}, 


|h(ez, €e-1, Cr-2,-- -) ret h(t, €&t—1, Ct-2,--- e€t-m,0,0,.. || < dte 


6 The remaining conditions involve restrictions on f (vt, 0) pertaining to continuity, exis- 
tence of certain moments and the existence and smoothness of derivatives up to the fourth 
order. 
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This condition implies that the dynamic behaviour of v, can be approximated by 
a nonlinear moving average of {e;_;; i = 0,1,...m} in the sense described above 
and so {e:_;; t = m+1,m-+2,...} have a negligible effect on the dynamics of v. 
This restriction implies that v; is a—mixing with mixing parameter a», = e~™ 
which is a much faster rate than is required for the Weak Law of Large Numbers 
or Central Limit Theorem.” In spite of this limitation, Assumption 8.1 is still 
satisfied by a number of empirically relevant models such as infinite order moving 
average processes with exponentially decreasing coefficients. 


The earlier asymptotic analysis places fairly mild restrictions on the auto- 
covariance structure of f(v, 0o) requiring simply that the long run variance, S, 
exists and is positive definite. For the bootstrap analysis here, it is necessary 
to limit this dependence structure as follows. 


Assumption 8.2 f(v;,09) is a k-Dependent Process 
E|f (vt, 90) f (vt-i, 007] = 0 fori > k and some k < ow. 


This assumption implies that 


k 
S = o +X: +T) (8.25) 


i=l 


where T; = Elf (vz, 90) f (vt-i, 00]. While this assumption is clearly not univer- 
sally valid, it is satisfied by a number of models of interest: witness our running 
empirical example of the consumption based asset pricing model in which the 
underlying economic theory implies that f(v:, 0o) satisfies this restriction with 
k = 0. Parenthetically, we note that there are grounds for anticipating that this 
assumption can be relaxed in future work. Inoue and Shintani (2003) consider 
the use of the bootstrap in linear models estimated by instrumental variables 
under very weak restrictions on the dependence structure of f(v;,99).° However, 
to date, these results have not been extended to GMM estimators of nonlinear 
models and so we do not consider their framework here. 


The last additional restriction highlighted here involves the composition of v 
in terms of continuous and discrete random variables. To date, no assumptions 
have been made regarding this aspect of the model. However, now they must. 


Assumption 8.3 Composition of v 
vı can be partitioned into (vo?) uw y where v® E R° for some c > 0 and 
uv ER? ford >0 andce+d=r. The distributions of vi? and Of (vz, 00)/00’ 


are absolutely continuous. The distribution of yi?) is discrete. 


7 See Appendix A for a definition of am. 

8 Inoue and Shintani (2003) provide a theoretical justification for the bootstrap in this 
case but find the potential gains are not as great as those described here. They also find that 
the potential gains are sensitive to the choice of kernel in the HAC estimator. 
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In other words, v; must contain at least one continuous random variable; the 
remaining elements may include discrete random variables but this is not neces- 
sary. While noteworthy, this assumption is unlikely to be particularly restrictive 
in practice because most economic models involve at least one continuous ran- 
dom variable. 

We now turn to the mechanics of the bootstrap. This discussion breaks 
down naturally into three parts. Section 8.1.2.1 considers appropriate designs 
for the bootstrap sampling scheme when the data are dependent, and this leads 
to a discussion of the block bootstrap. Section 8.1.2.2 describes the appropriate 
construction of the statistics whose bootstrap distributions are used to approxi- 
mate the distribution of our statistics of interest. This sub-section also includes 
a brief discussion of the so-called approximate bootstrap method that has been 
proposed to reduce the computational burden in nonlinear models. Section 
8.1.2.3 presents a rule for picking N, the number of bootstrap replications. As 
emerges below, the precise details require fairly lengthy explanation. Therefore, 
Section 8.1.2.4 summarizes the necessary calculations and also illustrates them 
using our running empirical example. 


8.1.2.1 Generation of Bootstrap Sample When the Data 
are Dependent 


There are two basic approaches to constructing the bootstrap sample when the 
data are dependent. These are known as the parametric bootstrap and the non- 
parametric bootstrap. In the parametric bootstrap, the resampling is based on 
an estimated model for v. As an illustration, suppose this estimated model 
is a VAR; in this case, the bootstrap sample for 1, is generated by resampling 
from the residuals with replacement, and then solving the model recursively. 
While relatively straightforward to implement, this approach is only guaranteed 
to deliver the types of gain described above if the assumed model for v+ is 
correct. It is this caveat that makes the approach unattractive in the types of 
models in Table 1.1 for which the data generation process of v+ is not completely 
specified. Therefore, we do not pursue the parametric bootstrap further here.’ 
Instead we focus on the non-parametric bootstrap. This method essentially 
involves sampling blocks of adjacent observations from the observed sample 
and so is commonly refered to as the “block bootstrap”. These blocks can 
be non-overlapping or overlapping.!° To illustrate the difference, consider the 
following example. Suppose that v is scalar and we have an observed sample 
of four observations, (v1, V2, 03, U4). Suppose further that it is decided to draw 
observations in blocks of two — the question of how to choose the block size 
is discussed below. Since the original sample has T = 4, it is necessary to 
draw two blocks with replacement from the original sample to make up one 
particular bootstrap sample. If the non-overlapping scheme is used then there 
are only two possible blocks: (v1, v2) and (v3, v4). If the overlapping scheme is 


9 The interested reader is refered to Andrews (20026) and the survey in Li and Maddala 
(1996). 

10 The non-overlapping scheme is proposed by Carlstein (1986), and the overlapping scheme 
by Kiinsch (1989). Each method is sometimes refered to by the name of its proponent. 
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used then there are three possible blocks: (v1, v2), (v2,v3) and (v3, v4). Both 
schemes seem intuitively reasonable. To date, there has been no comparison 
of the two in the context of GMM. However, the available evidence in other 
contexts suggest that the overlapping scheme is to be prefered.'! Nevertheless, 
we consider both below. 

While the previous illustration gives the flavour of the block bootstrap, it 
turns out that the construction of the base sample is actually more complicated. 
This complication arises because the bootstrap is justified using Edgeworth ex- 
pansions and these are only valid for dependent data if the statistics of interest 
are functions of sample moments involving the same variables and running over 
the same set of observations. If we view the sample in terms of the original 
observations {v,}7_, then this structure is not present. However, if we alter 
our perspective on both the sampling unit and the sample size then the de- 
sired structure can be restored. To illustrate, we consider the simple case in 
which k = 1 and so S = Io +T + Ti In this case the overidentifying re- 
strictions test and confidence intervals are functions of the three basic sample 
statistics: the sample moment condition T~! Pa f(v:,0), the derivative ma- 


trix T7! D Of (vz, 0)/06’ and the sample analog to the long run variance 


T Ay 
Sr = TY f(r, 8) F (v8) + TY flor, 0) F(vr1, 0 
t=1 t=2 


T 
+ T° YO f(v1,0)f (ve, 0)" (8.26) 
t=2 
It can be recognized that these three statistics do not collectively have the 
desired structure for two reasons. First, the sample moment and its derivative 
depend on v; but the variance depends on v, and v—1. Secondly, some of the 
summations start at t = 1 and some at t = 2. To solve the first problem, it is 
necessary to view the sampling unit as 


lal 
Ut-1 
To solve the second problem, the sample is restricted to the observations t = 
2,3...7T. These two amendments together imply that the sample is now viewed 
as consisting of {V;;t = 2,...T}. 

These ideas extend easily to the general case defined by Assumption 8.2. 
The base sample for the bootstrap is Vg = {V;;t =k+1,k+2,...T} where V, 
is the r(k + 1) x 1 consisting of the r x 1 vectors {v,_;; i = 0,1,...k} stacked 
into a vector as follows 

Ut 
Ut-1 


% = (8.27) 


Ut—k 


11 See Lahiri (1999). 
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It is important to realize that if the bootstrap is to yield the gains described 
in the previous sub-section then the GMM estimation must also be based on the 
same sample. This means, for instance, that the first and second step estimators 
must be calculated respectively as: 


Ôkr(1) = argmingce gx,7 (0) Wrgx,r[0 (8.28) 
Ês r(2) = argmingco gr,r[6]'{Sx,7[Ox,7(1)]} 1 grlo] (8.29) 


where 9k, |0] = (T — ky deer f (ve; 9), 


k 
Ô. rio] = Î ) + D {fian + ien } (8.30) 


w=1 


and Ô; œr) (0) = (T — k)! Ei pp f (vr, 9) f (uri, )'. Notice that the required 
structure for the bootstrap necessitates the use of the “truncated” covariance 
matrix estimator.!? In comparison to the original definitions of these estimators, 
it is clear that there is some loss of information associated with taking this 
approach because, for example, the contribution of {f(v;,@);i = 1,2,...k} to 
X; f(v 0) is lost. However, this is unavoidable because their retention would 
lead to a statistic that deviates from the required structure by terms of O,(T'~!) 
and this would negate the anticipated gain from the use of the bootstrap. 

Using the base sample in (8.27), we can now present the details of how to 
implement the block bootstrap. Needless to say, the exact details depend on 
the sampling scheme. 


Non-overlapping blocks: The base sample, Vg, is divided into b blocks of a 
pre-specified length £. Denote these blocks by {B;; i = 1,2,...b} where Bı = 
(Veri pao: Vere), B = (Vees, VELEZ Vrp) and so forth. Notice 
that this means T — k = bé; if this is not the case for the desired choice of £ 
then additional observations must be dropped from the sample to ensure con- 
formity. The n*” bootstrap sample is constructed by randomly sampling with 
replacement b blocks from {B;; i = 1,2,...b}. 


Overlapping blocks: Let I denote the set of observations that can begin a block 
of l observations, that is J = {k+1,k+2,...T7—€+1}. The construction of the 
nt’ bootstrap sample begins with random sampling from J with replacement 
b times. If this random sample from J is denoted by {i,;; 7 = 1,2,...b} then 
the nt? bootstrap sample then consists of the b blocks that begin with the 
observations {i;; 7 = 1,2,...b}. So the first block in the bootstrap sample is 
B® = (V, Vaai,- Vape), the second block is BS” = (Va, Vingi,- Vine) 
and so forth. 


In our subsequent discussion, it is necessary to express the bootstrap sample in 


12 See Section 3.5.3 for a discussion of its properties. 
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terms of the sampling unit instead of the blocks. Regardless of the sampling 
scheme used, we write the nt? bootstrap sample as V(") = {ver; SSH. x. T} 
where T = T — k. 

To implement this approach, it is clearly necessary to choose the block length 
£. Since the dynamic structure of the data is unknown, it is natural to consider 
rules in which the block size increases with T, such as 0p = CT!/¢. To date, 
there has been significant progress in deducing appropriate choices for a but less 
with regard to the selection of C. A complicating factor is that the choice of a 
depends on the statistic of interest. As mentioned above, we focus below on the 
use of the bootstrap to reduce the error in both the size of the overidentifying 
restrictions test and also in the confidence level of intervals for the unknown 
parameters. For these uses, Andrews (20026) shows the optimal choice of a is 
4.13 He further shows that this is the optimal choice if the bootstrap is used 
to reduce the error in the size of the Wald tests and t-tests. In contrast, if the 
objective is to use the bootstrap to estimate the distribution function of the 
absolute value of the t-statistic then Hall, Horowitz, and Jing (1995) show that 
the optimal choice of a is 5.14 To date there is no guidance available on the 
choice of C for minimizing the error in either the size of the overidentifying 
restrictions test or the confidence coverage of intervals based on the GMM es- 
timator. However, there has been some progress on this issue in other settings; 
see Hall, Horowitz, and Jing (1995) and Biihlmann and Künsch (1996). 


8.1.2.2 Calculation of the GMM Estimator and Related Statistics in 
the Bootstrap Samples 


It can be recalled from Section 8.1.1 that even in the simple example of inference 
about a mean in an i.i.d. context, the functional form of the t-statistic differed 
in the original and bootstrap samples. Specifically, the bootstrap version of t- 
statistic involves a correction to ensure that it is invariant to whether or not the 
null hypothesis holds. A similar modification is necessary in the GMM setting. 
However, there is a second problem here that necessitates the introduction of 
an additional correction factor. While the block bootstrap seems an intuitively 
reasonable method for resampling from dynamic data, it does not yield samples 
with identical time series properties to the original data. Fortunately, it is 
possible to remedy the situation by the introduction of an additional correction 
factor. The exact nature of these corrections is discussed below as they arise 
in the sequence of necessary calculations. The presentation here only considers 
the case in which inference is based on the second step-estimator although, in 
principle, all definitions can be modified to accommodate inference based on the 
iterated estimator.!° 

Recall that the sample is now viewed in terms of the augmented vectors 


13 Optimal in the sense that this choice minimizes the error between the nominal and actual 
size of the test, and the nominal and actual coverage probability for the confidence interval. 

14 Optimal in the sense that it minimizes the mean squared error. 

15 Hall and Horowitz (1996) and Andrews (2002b) consider inference based on either the 
first-step or second-step estimators. 
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{VL} but the sample moment only depends on a sub-vector of Vi”. Therefore, 
we decompose V;” to reflect its structure as defined by (8.27), as follows 


O51 Ut 
3, vt) Ut—1 
Vo = EP : (8.31) 
o b Stak 


where the last identity is heuristic and included to remind the reader of the 
structure of V. 

Before we proceed any further, it is necessary to address a matter relating to 
the notation. As with vir, all the statistics calculated in the bootstrap sample 
should be indexed by n. However, this makes the notation extremely cumber- 
some and so we suppress this dependence during this part of the discussion. As 
emerges below, the statistics of interest are functions of both the bootstrap and 
also the original sample. All statistics calculated from the bootstrap sample are 
indicated by a tilde accent (i.e. @); all statistics calculated from the original 
sample are indicated by a hat accent (i.e. â). No accent indicates that the 
statistic in question is a function of both samples. 

In the original sample, the GMM estimator is obtained by minimizing a 
quadratic form in the sample moment. In the bootstrap sample, this moment 
must be centred to ensure that it is zero at Êk r and thus mimics the property 
of the population moment which is zero at 69. This centered version of the 
bootstrap sample moment is calculated as: 


T 
9710; Ôk r] = dia felðs,1,0; 6:7) (8.32) 
s=1 
where i i 
felv, 0; Okr) = f(v, 0) — mr(ôk,r) (8.33) 


where the c subscript stands for “centred”, and m-(.) is calculated from the 
original sample but the exact formula depends on the sampling scheme. If the 
non-overlapping scheme is used then 


mr(9) = grrl] (8.34) 


where gx,7[6] is defined under (8.29) above. If the overlapping scheme is used 
then 


T 
mr(0) = (T-k-4+1) XO w(t) f(r, 9) (8.35) 
t=k4+1 
where 
(t —k)/é ift € [kK+1,€+k-]] 
w(t) = 1 ifte [@+k,T—L4+]1] (8.36) 


sey ift € (T—£4+2,T7] 
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The first step GMM estimator in the bootstrap sample is calculated as fol- 
lows: 


6-(1) = argmineco 97:10; Ôr r(1)] Wr 97[0; Ox,7(1)] (8.37) 
Notice that on this first step, the bootstrap sample moment is centred using 
mr(.) evaluated at the first step GMM estimator defined in (8.28). The second 
step GMM estimator in the bootstrap sample is calculated as: 


Ür) = argmingco l:in] {500 Aer} rl: er] 


where 
k 


SHBA = Pata) + DEO) + Ea} 


i=l 


T 
T; (9; 8) =op 5 felŭs,1,0; 9) felõs,i+1, 0; 0)! 
a=] 
and fe(.) is defined in (8.33). Two aspects of the second step minimand are 
worth noting. First, the bootstrap sample moment is centred using my (.) eval- 
uated at the second step estimator defined in (8.29). Secondly, the long run 
variance estimator is calculated using the centred sample moment. 

Recall that we consider here only two statistics associated with the second 
step estimator: the overidentifying restrictions test and a confidence interval for 
Oo,i- The distribution of the overidentifying restrictions test is approximated 
using the following statistic calculated in the bootstrap samples: 


ip = Aplőz(2)] (8.39) 

where!® 
Hg [0] =T 9510; Ôr r (ODSTO: 9x,r(2)]} A Srl; ôr r (2)]} arlo; kr (2)] 
(8.40) 


In the previous equation, Az, denotes the Moore-Penrose generalized inverse 
of the matrix Aj that is calculated as follows: 


Ar = Myr Seq? Bp Spal Mer 8.41) 
where 
Myr = h- Sra Gar Cnr Groen 8.42) 
Chr = Âk rsp Âr! 8.43) 
Str = Sr r lôr r (2) 8.44) 
Ĝkr = Ĉrrlôk r2) 8.45) 


16 Following the practice in this literature, Z-*/2 denotes the symmetric square root of 
ZT: for any nonsingular symmetric real matrix Z, that is if the spectral decomposition of Z 


is Z1Z2Z! then Z712 = 7,25 7/7 Zi. 
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for S;,7[6] is defined in (8.30), and 


T 
Ôr rlo] =T Y Af(v,,9)/00 (8.46) 
t=k+1 


The last component of Aj is a matrix Bz whose calculation depends on the 
sampling scheme employed. If the non-overlapping scheme is used then: 


b-1 £ £ 


Bp = TYO XO Y hrlil+j+k)hr(il+m+ k) (8.47) 


i=0 j=1 m=1 
where . J 
A(t) = fwi, Ox,r(2)) — mr (Or, r(2)) 
If the overlapping scheme is used then: 
7 7 Te £ e¢ 
Be = OT '(T—041) XOY X ho it j+hhrit+m+ky (8.48) 


oS 


The bootstrap version of the confidence interval is based on approximating the 
distribution of the absolute value of the t-ratio by 


T7105 (2) — Ox,r.4(2)! 


are, = eel = Ci - 
Vf {VF }ii 


where Ôp r (2) is the i*” element of 0,,7(2), {.};; denotes the i — i*” diagonal 


(8.49) 


element of the matrix in parentheses, and Vz is given by 


x CENS 3 a 
Vp = [Erir ðr); frr ő] (8.50) 
and 
(8.51) 
The correction factor c; is defined as 
(8.52) 
where Chr is defined above in (8.43), and 
Dp = Ôr rÂ, rôp r Brp rÂkTÊk,T (8.53) 


The bootstrap percentile is based on the absolute value because the objective 
here is to calculate a symmetric confidence interval for 60,;, that is of the generic 


form Ok Ti os nr. 
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Inspection of (8.39)—(8.40) and (8.49) reveals that the bootstrap versions of 
the overidentifying restrictions statistics and t-ratio have two types of correc- 
tion relative to their sample counterparts. First, each statistic has a “centering” 
correction similar in spirit to the correction required to the t-statistic in our mo- 
tivating example: — the sample moment is centred in the overidentifying restric- 
tions test; — the t-ratio is centred using the corresponding GMM estimator from 
the original sample. However, in this more general setting, it is also necessary 
to make a second correction and this leads to the presence of Az in (8.40) and c; 
in (8.49). These additional corrections are needed because the block bootstrap 
does not adequately replicate the time series properties of the original data. 
The problem stems from the long run variance. In the original population, this 
variance only involves terms of the form E[f(vs,00)f(vt,0)’] for |s — t| < k. 
However, in the population generated from the bootstrap distribution, this long 
run variance involves terms of the form E[f(vs,0)f(v1,90)'] for all s,t within 
the same block where E [.] denotes expectations relative to the bootstrap dis- 
tribution.!” Note that this second correction is only needed because the block 
bootstrap is used in an attempt to take account of the dynamic dependence 
structure in the data. If the data are independent then there is no need to use 
the block bootstrap and so the problem goes away. It is for this reason that this 
second correction is unnecessary in our motivating example in Section 8.1.1. 

Bootstrap methods are inherently computationally burdensome because of 
the resampling. However, in our setting, the burden is potentially far greater 
because it is necessary to perform two numerical optimizations for each boot- 
strap sample. Fortunately, it is not necessary to iterate gradient methods until 
convergence within the numerical optimization in order to gain the asymptotic 
refinements associated with the bootstrap.1® Davidson and MacKinnon (1999) 
present a heuristic justification for this statement and Andrews (20026) subse- 
quently provides a rigorous demonstration. For our purposes here, it suffices to 
concentrate on the heuristic argument. To illustrate, we focus attention on the 
Newton—Raphson method of optimization in which the the estimator is updated 
on the it” step of the numerical optimization according to 


Qr (8-1) |” aQrGi-1)) 
0000 06 


ali) = li- 1) — | 


Typically, this updating is continued until convergence to give the estimator Êr. 
However, for our purposes here, it is important to consider the way in which 
this convergence occurs. Robinson (1988) shows that if (0) — Êr = O,(T~'/?) 
then 0(i) — ôr = One): Using this property, Davidson and MacKinnon 
(1999) make the important observation that it is only necessary to iterate two 
steps before the difference between 0(i) and Êr is of smaller order than asymp- 
totic refinements associated with the bootstrap. Therefore, it suffices to use 


17 See Hall and Horowitz (1996) for further discussion. 
18 See Section 3.2 for a discussion of gradient methods. 
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the Newton—Raphson method with only two iterations in the numerical opti- 
mizations within the bootstrap samples. Needless to say, the specifics depend 
on the exact method of numerical optimization. If the Gauss-Newton method 
is used then at least three steps are needed in the numerical optimization; see 
Davidson and MacKinnon (1999) or Andrews (20026). Davidson and MacKin- 
non (1999) introduce the term “approximate bootstrap” to describe the generic 
strategy of fixing the number of iterations within the numerical optimization in 
the bootstrap samples. To implement the approximate bootstrap, it is neces- 
sary to have a suitable starting value for the optimization. The natural choice 
is the corresponding GMM estimator from the observed sample, and Andrews 
(20020) verifies that this choice is appropriate. 

Clearly the approximate bootstrap has the potential to reduce the compu- 
tational burden considerably. However, two caveats need to be borne in mind. 
First, the results above only yield a minimum number of iterations. Intuition 
suggests that the results can never be worse if more iterations are used within the 
numerical optimization routine. To date, there is no evidence on how the number 
of iterations affects the accuracy of subsequent inferences in overidentified non- 
linear dynamic models. Secondly, the available results only cover variants of the 
Newton—Raphson and Gauss—Newton routines. It is an open question whether 
the approximate bootstrap can be extended to other gradient methods. 


8.1.2.3 Choosing the Number of Replications 


It can be recalled that the motivation for the bootstrap derives from its ability 
to provide asymptotic refinements to inference. However, the argument is based 
on a consideration of what is often termed the “ideal bootstrap” in which the 
number of replications, N, tends to infinity. Clearly this version of the bootstrap 
is infeasible, and so we now consider a three-step method for the selection of N 
proposed by Andrews and Buchinsky (2000). The precise details of this method 
are application specific. Following our practice above, we limit our discussion 
to the two statistics of interest, the overidentifying restrictions test and a two- 
sided symmetric confidence interval for elements of 09. For the former, we focus 
purely on using the bootstrap to calculate the critical point for a given level of 
significance. Therefore, in both cases, the bootstrap is being used to calculate a 
pre-specified percentile of a distribution. It should be noted that our discussion 
is specific to this precise context. The details of the method would be different 
if, for example, the bootstrap is used to calculate the p-value of the test or if it 
is used to calculate two-sided equal-tailed confidence intervals.!9 

Since both our cases of interest involve the use of the bootstrap to calcu- 
late a particular percentile, we abstract to a generic notation to simplify the 
presentation. Accordingly, let =r denote the statistic calculated in the original 
sample and =) denote the statistic calculated in the nt? bootstrap sample. 
Let pr denote the true 100(1—a)% percentile of Er, fr... be the corresponding 


19 Andrews and Buchinsky (2000) consider both these cases along with many others of 
empirical relevance. 
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percentile based on the ideal bootstrap and pr,y be the same statistic based 
on the bootstrap with N replications. It turns out to be convenient to restrict 
attention to choices of N that satisfy: 


v 
N+1 


=l-a (8.54) 


where v is some positive integer, because then pr.n is the vt” order statistic of 
the bootstrap sample (20, n = 1,2,...N}. However, it should be noted that 
this is a non-trivial restriction: for example, œ = 0.05 then the set of possible 
values for N is {20h — 1; h = 1,2,3, ...}; see Andrews and Buchinsky (2000) for 
further details. 

Since the bootstrap is motivated by the properties of the ideal bootstrap, it 
is natural to base the selection rule for N upon some measure of the distance 
between pr,y and pr... In Andrews and Buchinsky’s (2000) three-step method, 
this distance is captured by the percentage deviation of pr,y from PT o, that is 


100 |Pr.n — PT,col 
PT, œ 


It can be recalled from Section 8.1.1 that the bootstrap percentiles are random 
because they are conditional on the observed sample. Therefore, any statements 
about the distance between the percentiles must have a probability attached, 
and so take the form 


pB 100 |Pr.n — Pr,oo| 


~ < pdb| = (1 — 8) (8.55) 
PT,co 


where, once again, P?|.] denotes probability based on the bootstrap distribu- 
tion. The three-step method provides a rule for selecting N for prespecified val- 
ues of the percentage deviation bound, pdb, and the probabilty 1— 8. Although 
it does not serve our purposes here to explore the theoretical underpinnings of 
the method here, it is worth noting that it derives from the following limiting 
result: > 7 
ni? ee pues ) 4, N(O, w) (8.56) 
PT,0o 
where w is application specific. The method involves the construction of con- 
sistent estimates for w that can be used in conjunction with the distributional 
result in (8.56) to deduce a value for N for which (8.55) is satisfied for a given 
pdb and 1 — 8. The details are as follows. 
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Andrews and Buchinsky’s (2000) three-step method for the selection of N 


Define a; and a2 such that a = a;/az where a, and az are positive integers 
with no common divisors.?° It is also necessary to specify (pdb, 3). 


Step 1: Calculate an initial value for N as follows: Nı = az2h, — 1 where 
hy = int[10, 00027_ 4,941 /(pdb?a2)], wy is an initial estimator of w given 
on a case by case basis in Tables 8.1-8.2, zs is the 1006 percentile of the 
standard normal distribution, and int|.] denotes the integer part. 


Step 2: Simulate Nı bootstrap samples {p; n = 1,2,... Nı} and compute 
the updated estimator of w, w2, the formula for which is given in Tables 
8.1-8.2 on a case by case basis. 


Step 3: Calculate an updated estimator of N as follows: Nə = agh2 — 1 where 
h2 = int{10, 00027_ g /2w2/(pdb?a2)). 


The selected number of replications is N* = maz{ N1, No}. 


Table 8.1 
w1 and we in the calculation of the 100a% critical value of the 
overidentifying restrictions test 


Wi Quantities used in w; 
a(l- a) zê- le-t/2 
w = ->ar sla) See d=- o) 
Pi—ag (Pi-a) T(d)2? 
Pl-a = 
the 100(1 — a)% percentile of the y7_,, distribution 
a(1—a) 2 { Ni; js i ot 
W2 = = j = 1 5p (Via — =n) f 
Pio” a ee i 
J* = thei” order statistic of (ie), n = 
1,2... N1} 
Dine a ca 
m= int[eqN?/?] 
Pica = Js 


"E l 1.52? a29" (Pı-a) \" 
i 39 (Piza)? — 9(Pi-a)9 (Pi-a) 
ga ){(d — 1)z-! — 0.5} 


g (x) = g(){(d—-1)x~* — 0.5} — g(x )(d-1)a~? 


Notes: In this context, '(.) denotes the “gamma function” and is not to be confused with our 
earlier use of this symbol for an autocovariance matrix. It is defined as follows: I'(0.5) = ./7; 
if a is an even integer then T (a/2) = (a/2 —1)...3.2.1; if a is an odd integer then I'(a/2) = 
(a/2—1)... 3.3.5 T; e.g. see Hamilton (1994) [p.355]. 

20 So, for example, if œ =  0.10,0.05,0.01 then (ai1,a2) are respectively 
(1, 10), (1, 20), (1, 100). 
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Table 8.2 
w and w2 in the calculation of the 100(1 — a)% symmetric 
confidence interval for 9 


Wi Quantities used in w; 
a(l- a) ~1/2,—27/2 
wy = =, olz) = (27) /2e77 
Zo? [26(21~0/2)|" 
a(l—a ~ Ny) ~ x y =i 
w2 = a h = {52 rh aTi vn) f 
dr}; = the j‘” order statistic of {ar n= 
1,2,... } l 
v = (Ni + 1)(1 — a) 
m= int[ca 7/7] 
Pica = ar’, 
1/3 
pens 621g salP(41-a/2) I” 
a 225 ajo +1 


8.1.2.4 Summary of Bootstrap Calculations 


Pulling together the previous discussion, it can be seen that there are four major 
steps to implementing the bootstrap in the context here. 


Step 1: Re-estimate the model using the truncated original sample as in (8.28)— 
(8.29). 


Step 2: Generate N; bootstrap samples in the way described in Section 8.1.2.1 for 
N, defined in the Andrews and Buchinsky’s (2000) three-step procedure 
in Section 8.1.2.3. 


Step 3: Calculate the bootstrap distributions for the statistic of interest as de- 
scribed in Section 8.1.2.2 and compute No, and hence N*, defined in 
Andrews and Buchinsky’s (2000) three-step procedure in Section 8.1.2.3. 


Step 4: Generate N* bootstrap samples as in Section 8.1.2.1 and calculate the 
required statistics of interest as in section 8.2.1.2.7! 


Inference is then conducted as follows. For the overidentifying restrictions test 
the decision rule is to reject Ho : E[f (v+, 0)] = 0 at the 100a% significance level 
if: 

Jr > J (8.57) 


21 Note that this step does not involve additional calculations if N* = Ni and only the 
generation of an additional Nə — Nı bootstrap samples if N* = Nə. 
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where Si is defined in Table 8.1. This decision rule yields an approximate 
100a% test because the true size of the test is given by: 


P(Jr > J | Ho) = a + o(T') 


The 100(1 — a)% bootstrap confidence interval for 69; is given by 


Onna + att, Via/(L —k) (8.58) 


where a7;,,* is defined in Table 8.2, Vi is the i — it element of V where 


V = (Gr rsp r Ekr 


and the component matrices are defined in (8.44)—(8.45). Once again, the at- 
tached probability statement is only approximate as the true coverage proba- 
bility is: 


IA 


P (nrs — ati y Via/(T- k) < 0, bra + dT} y Vii /(T — i) 


=l-a + o(T-*) 


We now illustrate these calculations using our running empirical example. 


Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

We use the bootstrap to calculate the critical value associated with perform- 
ing the overidentifying restrictions test at the 5% significance level, and the 
percentiles needed to construct 95% symmetric confidence intervals for the pa- 
rameters. Our description of the necessary calculations follows the four-step 
sequence described above. For this particular model Step 1 is redundant be- 
cause k = 0. To implement Step 2, it is necessary to determine the sampling 
unit and sample size for the bootstrap sample. It can be recalled for this model 


=i 
f0) = lt] 74122241 — 1) 
where 21441 = Ct+1/Ct, T241 = Tt+1/Pt and z = (1, £1,t, T24, £1,t—1; 022-1) - 
Therefore the sampling unit for the bootstrap sample is 


T1,t+1 
T2 t+1 
> Tit 


T2t 
Z1,t-1 
L2,t-1 
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Since k = 0, the bootstrap sample size is the same as the original sample, 
that is T = T = 465. We next turn to the block size, /. It can be recalled 
that from Section 8.1.2.1 that the optimal block size depends on the statis- 
tic being calculated. Since we calculate a fixed percentile of a distribution, 
the optimal block size is O(T'/*). For this example, the sample size is 465, 
and so T'/4 = 4.6437. Obviously, the block size must be an integer and so 
we fix £ = 5. This is a convenient choice because it means there are exactly 
ninety-three non-overlapping blocks in the sample. To calculate the number of 
replications, we set pdb = 0.05 and 8 = 0.95. The resulting choice of Nj is 2379 
for the overidentifying restrictions test and 1379 for the confidence intervals. 
Therefore, for simplicity, we set N, = 2379 for both types of statistics. Paren- 
thetically, it should be noted that in a very small number of the bootstrap sam- 
ples Sr [0-(2); 6x7 (2)] was singular. Therefore the number of bootstrap samples 
was actually set equal to N; + 50 and the calculations were based on the first 
N, bootstrap samples for which the calculation of the required statitics was 
succesful. 

We report results using both choices of first-step weighting matrix used in our 
empirical example. For the case in which wi) is the inverse of the instrument 
cross product matrix, the first step estimation is performed in the bootstrap 
sample using the corresponding matrix constructed from the bootstrap sample, 


denoted here by (T~! 7j_, %2,)7! . Given the nonlinearity of the model, it 
is particularly attractive to use the approximate bootstrap to reduce the com- 
putational burden. Unfortunately, to date, there are no theoretical results to 
offer guidance on the number of steps needed to obtain asymptotic refinements 
for the particular optimization routine used in fminu in MATLAB. So for illus- 
trative purposes, the method is performed using 2, 4, 6,8, 10,20, 30, 40,50 and 
100 steps. For this example, it turns out that the percentiles are sensitive 
to the number of steps allowed. As an illustration, Table 8.3 reports the ap- 
propriate percentiles based on N = N; replications for the case in which the 
asset is VW R using both choices of first step weighting matrix and blocking 
scheme. It should be noted that, for a given blocking scheme, the calcula- 
tions reported in Table 8.3 are all based on the same set of bootstrap sam- 
ples and so the only difference is in the number of steps in the approximate 
bootstrap. 

As a reminder, the corresponding percentiles from the limiting distributions 
are 7.815 for the overidentifying restrictions test, and 1.96 for the t-statistics. 
Inspection reveals that the bootstrap percentiles are for the most part close 
to these limiting values. However, the percentile are also clearly sensitive to 
the blocking scheme, the number of steps in the approximate bootstrap, and 
also the choice of first step weighting matrix. Of these three, the nature of 
the Edgeworth expansion would lead us to anticipate the sensitivity of the per- 
centiles to we), but the rest are less easily explained. The sensitivity of the 
bootstrap percentiles to the number of steps in the numerical optimization sug- 
gests that the theoretical results outlined above do not extend to the routines 
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Table 8.3 
Bootstrap 95°” percentiles for the overidentifying restrictions test 
and the absolute values of the t-statistics with VWR 
Non-overlapping blocks Overlapping blocks 


5th 


wd) mai Jp aT py ATF 5 Jp aT py aT Tp 5 
10° Is 8.311 0.000 1.161 8.071 0.000 1.087 


2 
4 8.311 0.000 1.161 8.071 0.000 1.088 
6 8.208 1.586 1.593 7.664 1.518 1.521 
8 7.740 2.018 1.983 7.499 1.958 1.983 
10 7.794 1.995 1.990 7.438 1.937 1.966 
20 7.747 2.035 2.009 7.322 1.985 1.990 
30 7.706 2.030 2.018 7.263 2.002 2.006 
40 7.731 2.035 2.040 7.264 2.007 2.010 
50 7.731 2.049 2.027 7.266 2.007 2.012 
100 7.736 2.048 2.027 7.266 2.007 2.012 


(TO Bz)! 2 8.238 0.000 1.156 7.564 0.000 1.100 
4 8.238 0.000 1.156 7.564 0.000 1.103 

6 8114 1.496 1.529 7.546 1.573 1.621 

8 7.936 2.012 1.966 7.258 2.024 1.973 

10 7.822 1.964 1.960 7.214 2.022 1.971 

>20 7.814 1.966 1.960 7.251 2.022 1.973 


Notes: Jp is defined in (8.39), dT y and drf 5 are defined in (8.49) with the i subscript 
replaced by the symbol for the parameter in question. imax denotes the maximum number of 
steps in approximate bootstrap. 


used here — at least for imax < 100. A striking feature of this sensitivity is 
that the percentile for aT jy 1S Zero to five decimal places with imaz = 2,4 with 
either choice of weighting matrix or blocking scheme. We conjecture that this 
reflects a feature of the moment condition noted in Section 3.6, namely that it 
is nearly uninformative about yo. However, a deeper analysis is left to future 


research. 


The next step is to calculate Nə. For brevity, we focus on the case where 
imax = 100. Table 8.4 reports the values of Nə and the final bootstrap 
percentiles for both choices of asset. As can be seen, the bootstrap percentiles 
for the overidentifying restrictions test do not alter our verdict about the spec- 
ifications. As with inference based on the asymptotic critical values, the model 
is rejected for EWR but not with VWR. Given this evidence, it is only inter- 
esting to consider the confidence intervals for the parameters based on VW R. 
However, since the percentiles for the t-statistics for y and 6 are so close to the 
corresponding values from the standard normal distribution, this is left as an 
exercise for the reader. 
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Table 8.4 
Nə and final bootstrap 95t” percentiles for the overidentifying 
restrictions test and confidence intervals for yo and do 


Non-overlapping blocks Overlapping blocks 
Asset WP No(ip) Nalar) Nol@rg5) No(Jp) No(drz,,) Nolar s) 


VWR A 2999 1819 1819 6659 1399 1199 
B 4939 1159 1299 2199 899 1179 
EWR A 2799 939 939 3419 959 1359 
B 4079 1999 1919 2139 1499 1099 
Non-overlapping blocks Overlapping blocks 
Asset we) J aT py ATÈ 5 Jp ATE y ATF 5 
VWR A 7.684 2.013 2.000 7.544 1.973 1.994 
B 7.566 2.045 2.020 7.251 2.022 1.973 
EWR A 7.701 NA NA 7.667 NA NA 
B 7.415 NA NA 7.573 NA NA 


Notes: N2(.) denotes the value for N2 calculated using the formulae for associated with the 
statistics in the parentheses. A denotes wi) = 10°Js5, B denotes wi) =(T-! yan %z,)72. 
NA denotes “not applicable”. For other definitions see Table 8.3. 


8.2 Inference in the Presence of Weak 
Identification 


The asymptotic theory in Chapters 3 and 5 is predicated on the assumption 
that the parameter vector is identified by the population moment condition 
used in the estimation. In recent years there has been a growing awareness 
that this proviso may not be so trivial in situations which arise in practice. 
In a very influential paper, Nelson and Startz (1990) draw attention to this 
potential problem and provided the first evidence of the problems it causes for 
the inference framework we have described above. Their paper has prompted 
considerable interest in the behaviour of GMM in cases in which the parameter 
vector is weakly identified. In this section we provide a review of this literature. 

To begin, it is necessary to define what is meant by the term “weakly identi- 
fication”. The essence of the concept is most easily understood within a simple 
example. Accordingly, we consider the simple linear regression model 


Ys = Tiho + u (8.59) 


in which w is an i.i.d. process with mean zero and variance Ge. Suppose the 


scalar parameter 69 is estimated by Instrumental Variables which, as we have 
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seen, is just GMM estimation based on the population moment condition 
Elzeur (8o)] = 0 (8.60) 


where z; is a q x 1 vector of instruments and u;(60) = yt — 2409. From Section 
2.1, it can be recalled that ĝo is identified by (8.60) if rank{E[za,]} = 1. In 
this simple example, ĝo is unidentified if E[z,2;] is the null vector, which would 
occur if z; and x; are uncorrelated and both possess zero means. In practice, it 
is unlikely that E[z,2;] is exactly zero. The contribution of Nelson and Startz’s 
(1990) paper is to demonstrate that problems occur if E[za;] is non-zero but 
small. It is this scenario which is refered to as “weak identification”. It is also 
convenient to have a terminology that describes this scenario in terms of the 
population moment condition. Therefore, if the parameter vector is weakly iden- 
tified then the population moment condition is said to be nearly uninformative 
about the parameter vector. 

To proceed, it is necessary to develop a model which can capture the idea 
of nearly uninformative moment conditions. Staiger and Stock (1997) solve this 
problem by assuming that f 

Le = AIT + & (8.61) 


where yp = T~'/2c, c is a non-zero q x 1 vector of constants, and e; is the 


unobserved error which has both a zero mean and is uncorrelated with z;.?? 
Using similar logic to the derivation of (2.3), it follows that this specification 
implies , 

Eraul] = {Elzz,J}T~/c(O0 — 0) (8.62) 
Therefore, 0) is identified by (8.60) for finite T but is not in the limit as 
T — œ.? So the concept of nearly uninformative moment conditions is cap- 
tured by assuming that the information in the population moment condition 
disappears at rate T~!/2. This rate is chosen so that the effects of the nearly 
uninformative moment conditions manifest themselves in the limiting behaviour 
of the estimator. Since p = 1, we have 


a! Z(Z'Z)4+Z'u 


a = E 
T 0 OLR a 


(8.63) 


in the obvious notation. As in Section 2.3, the limiting behaviour of Êr — bo 
depends on the limiting behaviour of the components on the right hand side of 
(8.63). Using the Weak Law of Large Numbers and the Central Limit Theorem 


respectively, it follows that: (i) T~!Z/Z 2, Mzz, a positive definite matrix of 
constants; (ii) T~!/?Z/u 4 N(0,0@M-.) — assuming here for simplicity that z; 


22 Notice this design involves exactly the same type of Pitman drift that is used to set 
up local alternatives to hypothesis tests; see Section 5.1.3. Equation (8.61) implies the ex- 
planatory variable is a triangular array {£+ r;t = 1,2,...T;T =1,2,...} but we suppress the 
second subscript for notational brevity. This structure also implies that the distribution of xt 
is indexed by T and so we index the expectation operator by T when it is applied to functions 
of x. 

23 See Section 2.1 for a discussion of identification in the linear model. 
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is independent of us. Notice that neither (i) nor (ii) involve the relationship 
between x; and z and so would equally hold if @) is properly identified. The 
key difference comes in the behaviour of Z’x. From (8.61), it follows that 


Beat Ve Ze + Z'e (8.64) 


where e is the T x 1 vector with tt element e,. Therefore, T~!Z’x 0 and 
T122 4, N(M.,c,02M,-,). The nature of this limiting behaviour means 
that, 


Pega PAD ZO Te Z'u 


626) 2. 2 ee eee 
T: n T-1/2g' APA AYP AP as. 


"yl 
a WM Y (8.65) 
pi Mz Y 


where Yı ~ N(M:z¢,02Mzz) and Y2 ~ N(0, 02 Mzz). Therefore, Êr converges 
to a random variable when parameter vector is weakly identified in the sense of 
(8.61). This is in marked contrast to the case when ĝo is identified because then 
Êr converges in probability to 09.24 

This simple example provides a clear indication that the asymptotic theory 
derived in Chapters 3 and 5 is inappropriate for the weak identification case. As 
a result, three questions naturally arise: — what is the behaviour of the GMM 
estimator in dynamic nonlinear models when the parameter vector is weakly 
identified? — is it possible to perform inference about 6) in this setting and 
if so how? — is it possible to test whether 09 is identified by the population 
moment condition? These three questions are covered respectively in Sections 
8.2.1 through 8.2.3. 

Before we proceed to this discussion, it is worth emphasising the intended in- 
terpretation of this framework. The definition of weak identification is artificial 
in the sense that it is not seriously believed that real economic data are gener- 
ated by processes with Pitman drift. This is simply a mathematical device that 
is used to generate a limiting distribution theory that provides a good approx- 
imation to finite sample behaviour in cases when — in the terms of our simple 
example — E[x,z;] is small but non-zero. However, it should be noted that if 
E|x121] is non-zero then the asymptotic theory in Chapter 2 (or more generally 
Chapters 3 and 5) is valid. The problem is that it may take a very large T 
before this asymptotic theory provides a good approximation. Hahn and Inoue 
(2002) provide simulation evidence that suggests that conventional asymptotics 
can provide a satisfactory approximation in the types of large dataset encoun- 
tered in microeconometrics (i.e. T = 10,000) unless the number of instruments 
is large and the correlation between the endogenous regressor and the instru- 
ments is pathologically small.2° However, available evidence suggests that this 


24 See Section 2.3. 
25 Hahn and Inoue (2002) compare a number of methods for constructing confidence in- 
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is not the case in the sample sizes encountered in macroeconometrics and that 
in these cases the theory discussed below provides a better approximation. 

It is also worth noting that the approach described above is not the only 
possible way to obtain an alternative distribution theory for the IV estimator 
in the presence of weak identification. One alternative is to use the limiting 
distribution theory developed by Bekker (1994) that is based on the assumption 
that the number of instruments increases with the sample size.2° Hahn and 
Inoue (2002) find that this distribution theory provides a good approximation 
in the types of sample sizes encountered in microeconometrics. However, this 
approach has not yet been extended to nonlinear models and so we do not pursue 
it further here. 


8.2.1 The Limiting Behaviour of the GMM Estimator 


To date, weak identification has been mostly encountered in models estimated 
by Generalized Instrumental Variables estimators, and so our discussion focuses 
on this case.?” In this setting, the problem of weak identification is commonly 
refered to as the “weak instrument” problem. Staiger and Stock (1997) develop 
the limiting distribution theory in linear models, and Stock and Wright (2000) 
extend the analysis to nonlinear models. It is the latter that is our focus here. 
One central finding of these papers is that the usual limiting distribution the- 
ory does not apply and this motivates the presentation of alternative inference 
methods in the following sub-section. In view of this, our discussion concen- 
trates on the framework for capturing nearly uninformative moment conditions 
and the conclusions to be drawn from the nature of the limiting distributions. 
The interested reader is refered to Stock and Wright (2000) for detailed deriva- 
tions. Although the analysis is in terms of GIV estimation, it is worth noting 
that the corresponding results for GMM can be obtained by setting z; = 1. 

As mentioned above, we focus on the following class of moment conditions. 


Assumption 8.4 GIV Estimation 
Let f(v, 0) = ur(@) Q Zt. 


In our simple example above, the parameter vector consists of a single element 
and so, by construction, the entire parameter vector is weakly identified. In 
more general settings, logic dictates that some elements of the parameter vector 
may be identified and others weakly identified. To accommodate this scenario, 
we partition the parameter vector as follows: 0 = (¢’,~')’ where @ is pg x 1 
and w is py x 1 where p = pg + py. Similarly, we write © = @ x W in the 
obvious notation. Below ¢ consists of the weakly identified parameters and 
w the parameters that are identified. As before, it is assumed that q > p 
and so problems with identification are not due to too few population moment 


tervals in the context of a simple linear regression model. Also see Section 6.2.1 for further 
discussion of the connection between identification and the passage to the limiting distribution. 
26 See Section 6.1.3. 
27 See Section 7.2. 
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conditions per se but rather the poor quality of the information in these moment 
conditions.?° 

It is pedagogically easier to present the mathematical framework used to 
capture this scenario and then discuss how it achieves the desired goal. This 
framework is given in the following assumption. 


Assumption 8.5 Weak Identification 
(i) Er[f (ve, 0)] = T=?mi, r (0) + p2(). 


(ii) mir(8) — u (0) uniformly in 0 € ©, mı(00) = 0, uı(0) is continuous in 
0 and is bounded on O. 


(iii) u2(Yo) = 0, u2(4) #0 for b # Yo, Ma(Y) = uz(Y)/ƏY' has full rank at 


w = po and is continuous. 


From Assumption 8.5, it can be seen that the population moment condition is 
satisfied at 09. At other parameter values, ÆEr[|f(vs,0)] consists of two parts. 
The first part, T~!/?m,,7(0), decays to zero at rate T~'/?. Therefore, this first 
part is nearly uninformative about both ġo and wo. In contrast, the second part, 
u2(%), is non-zero for any w Æ wo and so is informative about qo. Therefore, 
within this framework, ¢o is weakly identified but wo is identified. 

Before proceeding further, it is worth briefly contrasting this framework 
with the scenario of redundant moment conditions described in Section 6.1.2. 
It can be recalled that E[f2(vz,00)] = 0 is redundant given E[f1(v:,60)] = 0 
if the asymptotic variance of the GMM estimator is the same whether esti- 
mation is based on E[f1(vz,00)] = 0 alone or on both Efi (vz, 40)] = 0 and 
E|fo(vt, 00)| = 0. The presumption in this earlier discussion is that ĝo is identi- 
fied by E[ fi (vz, 90)] = 0. Therefore the literature on redundant moment condi- 
tions addresses the problems encountered if uninformative moment conditions 
are included along with informative moment conditions when the parameter vec- 
tor is identified.2? In contrast, the literature on weak identification addresses 
the problems that arise when the population moment conditions are collectively 
nearly uninformative about the parameter vector. As might be imagined, the 
consequences of redundant and nearly uninformative moment conditions are 
quite different.°° 

Stock and Wright (2000) [Corollary 4] present the limiting distributions of 
the first step, second step and continuous updating GIV estimators. For brevity, 
we focus on the second step estimators, ¢7(2) and Yr(2). However, it emerges 
that these distributions depend in part on the behaviour of the first step esti- 
mators, ér(1) and rl 1), and so the relevant aspects of the limiting behaviour 
of the first step estimators are also summarized below. The analysis rests on the 


28 See Section 3.1 for a discussion of identification. 

29 See Sections 6.1.2, 7.3.2 and 7.3.4. 

30 See Hall, Inoue, Jana, and Shin (2003) for further discussion of the connections between 
redundancy and weak identification. 
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empirical process representation for the GMM objective function.*! Within this 
framework, T'/?g7(6) is treated as a function of @ that converges to a Gaussian 
process.3? The limit process is assumed to have the following properties.*% 


Assumption 8.6 Functional Central Limit Theorem 
T 2 g7(0) = U(0) where U(0) is a Gaussian stochastic process on © with mean 
zero and covariance function E[W(0,)U(02)'] = Q(01, 02). 


The following lemma characterizes the limiting behaviour of the two step 
GIV estimators where, for simplicity, it is assumed that the first step weighting 
matrix is Ig. 

Lemma 8.1 Limiting Behaviour of GIV Estimators 
Under assumptions 8.4 — 8.6 and certain other regularity conditions,’ the 
limiting behaviour of the GIV estimators is as follows: (i) br(1) a $9, 
br(1) > yo; (ii) 
or (2) | L | p5 
T(r (2) — Yo] Ay 
where 


o% = argminges QP 
( 


( 
om = = argminges QP ($) 


Ay = -[F (99, An (62, WTF, vA) wA, bo) 
+ u (9g , Yo)] 
and 
QM (d) = [Tle Yo) + mle, Yo) Ci (bo) [U(4, bo) + nlé, vo) 
Ci(yo) = I — Mea(vo)[Ma(wo) M2(Y0)]7 M2 (Y0) 
o = [U(d, vo) + m($, Yo)I'C2(Y, YoU (A, Yo) + u ($, vo) 
Ca(99, yo) = 20 1 KR Yo)Q01) 7"? 
K(¢D po) = Iq — F(OQ, do) LF L, 40) F(OD, po) F (OY, Yo)! 
F(¢9), po) = Q0) on 
a) Mee AO 
1) = QaRa = 2a(6%,e) 
oD = 6)", bo)’ 


31 See Andrews (1994) for a review of empirical process theory. 

32 A similar device is used to develop a limiting distribution theory for structural stability 
tests in Section 5.4.2.1. Note that in the context of structural stability tests, the partial sum 
is treated as a function of the break fraction 7. 

33 See Andrews (1994) or Stock and Wright (2000) for more primitive conditions under 
which the Functional Central Limit Theorem holds in this setting. 

34 These include an identification condition that is omitted here to simplify the presentation. 
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and M>2(w) is defined in Assumption 8.5. Lemma 8.1 has two important implica- 
tions. First, the estimator of the sub-vector of the weakly identified parameters, 
ÊT, converges to a random variable on both steps and so is not consistent. Sec- 
ondly, the estimator of the sub-vector of identified parameters, wr, is consistent 
but its limiting distribution is no longer normal.’ In other words, inference 
about the identified parameters is contaminated by the presence of the weakly 
identified parameters. The bottom line is that none of the inference procedures 
in Chapter 5 are valid in the presence of weak identification. The following sub- 
section considers alternative approaches to inference that have been proposed 
to circumvent these problems. 


8.2.2 Inference in the Presence of Weak Identification 


Since the GMM estimators exhibit non-standard limiting behaviour, the con- 
ventional estimator based approach to inference is infeasible. To resurrect infer- 
ence in this setting, an alternative approach is required. This approach involves 
finding a statistic whose limiting distribution at ĝo is both standard and also un- 
affected by the presence of weak identification, and then inverting this statistic 
to construct confidence sets for 69. In the context of linear models estimated by 
IV, Staiger and Stock (1997) explore this approach based on the Anderson and 
Rubin (1949) statistic. In the same context, Wang and Zivot (1998) show that 
modified versions of the Wald, LM and D statistics are bounded by a statistic 
of known distribution and so can also form the basis for confidence sets. These 
approaches are compared in Zivot, Startz, and Nelson (1998). We do not re- 
view these papers in more detail as the results only apply to the linear setting. 
Instead, we focus on the method proposed by Stock and Wright (2000) that is 
valid in nonlinear models. 

As noted above, it is necessary to find a statistic whose limiting distribution 
at ĝo is both standard and also unaffected by the presence of weak identification. 
Fortunately, such a statistic is close at hand. Under the conditions of Lemma 
3.2,°5 it follows that 


TQeont,r(90) = Tgr(90)'Sr(9) gro) $ X (8.66) 


Therefore, Stock and Wright (2000) propose inverting TQcont,r(00) to obtain 
the approximate 100(1 — a)% confidence set 


{9 : TQeont,r(8) < cala) } (8.67) 


where cq(a) is the 100(1— a) percentile of the x2 distribution. This approach to 
inference is discussed in Section 3.7 where it is proposed as a way of constructing 
confidence sets that are invariant to reparameterization. In the context of weak 
identification, such sets are often refered to as “S-sets”, a terminology derived 


35 The distribution of the T2 hbr (1) — wo] is qualitatively similar to that of T!/?[p(2) — 
wo]; see Stock and Wright (2000). 
36 See Section 3.4.2. 
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from the notation in Stock and Wright (2000). However, since our notation is 
different, we eschew this name. Notice that the distributional result in (8.66) 
holds regardless of whether or not 0o is identified or weakly identified, and so 
this approach to inference is equally valid in both cases.?” 

The confidence sets in (8.67) have the appealing feature that they can be 
infinite in one or more dimension and so reveal that the elements of the param- 
eter vector in question are unidentified. This contrasts with the conventional 
asymptotic confidence intervals in (3.27) which are by construction of fixed 
length. In fact, the intervals in (3.27) are fundamentally flawed in this setting. 
Dufour (1997) shows that for a confidence interval to have the stated coverage 
probability then it must be possible for it to have infinite length.*® 

However, this approach to inference also has its drawbacks. Kleibergen 
(2000) has pointed out that the confidence sets in (8.67) are not centred on Or. 
Whether this is perceived as a drawback may be a matter of taste. Kleibergen 
(2000) uses this feature to motivate confidence sets based on inversion of a 
statistic based on the first order derivative of the continuous updating GMM 
minimand; see Kleibergen (2000) for further details. A more serious drawback is 
that the inversion in (8.67) is only computationally feasible for relatively small 
values of p. This problem can be ameliorated if the partition between the weakly 
identified and identified parameters is known and it is the weakly identified 
parameters that are of interest. In such circumstances, valid confidence sets 
can be based on the minimand of the restricted GMM estimation in which the 
minimization is performed over the identified parameter, Y, conditional on a 
value for the weakly identified parameter, ¢. To flesh out the details, it is 
necessary to introduce some additional notation. Let Vrlo) denote the GMM 
estimator of 7 conditional on ¢ = ¢, that is 


br(d) z argminyew TQeont,t(9)| 4-3 (8.68) 


Stock and Wright (2000) show that 


TQcont,T (20. br(¢0)) 2 Xapi (8.69) 


where, as before, py is the dimension of 7, and so propose the following asymp- 
totically valid 100(1 — a)% confidence set for do, 


Lo : TQeone.r (4, br(4)) < cazr (2) } (8.70) 


We now illustrate this alternative approach within the context of our running 
empirical example. 


37 See Section 3.7 for a discussion of other reasons for using this approach to inference when 
the parameter vector is identified. 

38 The intervals in (3.27) may also be invalid if 00 is identified but there is a subset of the 
parameter space, Oun, in which @ is unidentified; see Dufour (1997) for further discussion. 
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Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

The confidence sets in (8.67) have already been presented in Section 3.7. As a 
reminder, the set for (yo, 69) was non-empty but bounded for the value weighted 
returns (VWR) case, and empty for equally weighted returns (EWR). The latter 
indicates misspecification, and it is consistent with our findings based on the 
overidentifying restrictions test in Section 5.1. Therefore, we concentrate on the 
VWR case here. The available evidence suggests that this is one case in which 
the partition of the parameter vector might reasonably be taken to be known 
with the weakly identified parameter being ¢ = y and the identified parameter, 
w = ô. Using this partition, we use (8.70) to calculate a confidence set — or 
interval as pg = 1 — for y. To make the calculation feasible in practice, it is 
necessary to discretize the parameter space for y. This was done as follows. 
To begin, the parameter space is taken to consist of all points lying between 
—50 and 50 on a grid with 0.01 between each point. This leads to an interval of 
[—4.88, 7.19]. To refine this interval, the calculations are redone for the finer grid 
consisting of all points between —5.000 and 7.200 with 0.001 between each point. 
The resulting interval is [—4.886, 7.188]. This interval is almost twice the width 
of the interval reported in Table 3.8 that is based on the traditional asymptotic 
confidence interval given in (3.27). It is also asymmetric around 47, which is 
0.666 for this model.®? Therefore, the use of these alternative asymptotics leads 
to different conclusions about the set of plausible values for yo. o 


8.2.3 The Detection of Weak Identification 


It is clear from the discussion in Section 8.2.1 that the presence of weak iden- 
tification renders the conventional asymptotics inappropriate. This motivates 
the development of the alternative approach to inference described in Section 
8.2.2 based on methods that are robust to the presence of nearly uninformative 
moment conditions. The difference between these two inference frameworks nat- 
urally raises the question of how a researcher should perform inference if the 
identification of the parameters is suspect. One solution is to adopt the confi- 
dence set framework in Section 8.2.2 because it is valid regardless of whether 
or not 60 is identified. However, there are at least three reasons why it may be 
desirable to diagnose whether or not the parameters are identified. First, the 
confidence set may only be feasible in cases where p is relatively small. Sec- 
ondly, there is far wider array of inference procedures available if the parameter 
vector is identified. Finally, if the parameter vector is identified then the point 
estimator is consistent and this knowledge may affect our interpretation of the 
estimates. Therefore, in this section, we consider methods that have been pro- 
posed for testing identification. Even more than other aspects of the literature 
on weak identification, this topic has been addressed within the context of lin- 
ear regression models estimated by IV. In spite of this limitation, we review the 
available results because the qualitative conclusions likely extend to nonlinear 


39 See Table 3.7 
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models. However, it is left to future research to extend the methods described 
here to the GMM framework with nonlinear dynamic models. 

To initiate the discussion, we first consider how the presence of weak iden- 
tification might be detected within the simple motivating example given in at 
the beginning of this sub-section. It can be recalled that the population mo- 
ment condition in (8.60) is nearly uninformative about ĝo because Er[x:| zt] is 
decaying to zero at rate T712. Or put more simply, the relationship between 
xı and z is dying out as T — oo. Intuition suggests that this state of affairs 
can be uncovered by running the regression of x; on z; and examining standard 
diagnostics for goodness of fit. Since it is desired to develop a test for iden- 
tification, the most convenient diagnostic is the F-statistic for the hypothesis 
that the coefficients on z; are all zero in this regression. Bound, Jaeger, and 
Baker (1995) advocate that this first stage F'-statistic be rountinely reported as 
a “rough guide” on the strength of the identification. For our purposes here, it is 
useful to formalize this recommendation. To this end, we denote the regression 
model for x; on z; by*° 

Tt = ziy + error 


Let F} be the F-statistic for the hypothesis that y = 0 in this model. In terms 
of identification, the null and alternative hypotheses are interpreted as follows: 


Ho: y = 0 = 6 not identified (8.71) 
Hy: y # 0 = o identified (8.72) 


Therefore, 69 is deemed identified if F} is significant. 

This generic approach can be extended to more general linear models in 
which z+ is a vector and includes both endogenous and exogenous regressors. 
Hall, Rudebusch, and Wilcox (1996) propose testing for identification based on 
the canonical correlations between x; and z;. Shea (1997) proposes a method 
based on the partial correlations between x; and z+. Cragg and Donald (1993) 
propose a test based on the rank of the coefficient matrix in the reduced form 
regression of x, on z+. However, we do not consider the specifics of these tests 
further but instead focus on two aspects of this generic approach, namely the 
implications of testing for identification prior to inference about 6) and the 
interpretation of the alternative hypothesis. 

The ultimate focus of the analysis is #9, and so it is important to consider 
how the use of such a test for identification affects subsequent inferences about 
the parameter vector. The answer is that it depends both on how the test 
for identification is used and also on the statistic used to perform inference 
about o. There are two ways in which the test for identification could be 
used and for simplicity, we discuss these in the context of the simple example 
above. One option is to use F} to select the instrument vector. Within this 
approach, F} is calculated for a sequence of possible choices of instrument and 
the selected instrument vector, z say, is the first in the sequence for which Fy 


40 In this context, this regression is often refered to as “the first stage regression” as it is 
the first stage of a Two Stage Least Squares estimation of (8.59). 
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is significant. Estimation is then based on the moment condition in (8.60) with 
z, = žy and inference is performed under the assumption that ĝo is identified. 
A second option is to treat z as given and then use F to determine which 
statistical theory is used to perform inference about 69. With this approach, 
an insignificant F value leads to inference based on a method that is valid 
if ĝo is weakly identified, and a significant F}, leads to inference based on a 
method that is valid if #9 is identified. The evidence to date suggests that the 
first option is not a good strategy but the second is. Hall, Rudebusch, and 
Wilcox (1996) report simulation evidence on a variant of the first option above 
in which a test for identification is used to select the instrument vector and 
subsequent inference about the (scalar) parameter 09 is performed using the 
confidence interval in (3.27).41 This evidence indicates that inferences about 69 
can be severely distorted in the sense that the actual coverage probability of 
the confidence interval is much smaller than the nominal level. However, there 
is a caveat to this finding. The confidence interval in (3.27) can be interpreted 
as containing all values of @ for which Hp : 09 = @ is not rejected using the 
Wald statistic at the 100a% level. Zivot, Startz, and Nelson (1998) show that 
the behaviour of the Wald statistic is severely distorted by the presence of 
weak identification whereas the LM and LR tests are far more robust. Zivot, 
Startz, and Nelson (1998) report comparable evidence for the case in which the 
confidence interval is calculated by inverting the LR and LM tests for Ho : 49 = 
6. This evidence indicates that the use of the Wald test based confidence interval 
does indeed account for a substantial part of the distortions reported by Hall, 
Rudebusch, and Wilcox (1996). However, non-trivial distortions remain even 
if inference is based on the LR or LM tests. Zivot, Startz, and Nelson (1998) 
also report simulation evidence for the second option in which the choice of 
instrument is taken as given and F} is used to determine the statistical theory 
employed. Within their design, 9) is a scalar and the confidence interval is 
constructed by inverting either the LM or LR test for Hp : o = 0. The 
value of F} determines the distribution used to approximate the behaviour of 
these statistics. Their evidence indicates that the coverage rate is very close to 
the nominal level regardless of whether ĝo is unidentified, weakly identified or 
identified. 

For all the tests of identification, the null is that 09 is unidentified and the 
alternative is that 05 is identified. While it is true that failure to reject the null 
indicates a problem, Stock and Yogo (2001) argue that rejection of the null at 
conventional significance levels does not necessarily imply that “conventional 
asymptotics” provide a good approximation. This is certainly true for the Wald 
statistic considered in their paper, and likely true for other statistics as well. 
They further argue that the definition of what constitutes weak identification 
should reflect the nature of the desired inference about 09. In the context of 
IV estimation of a linear model, they consider two criteria for whether the 
instruments are weak: one based on a measure of the bias in Êr and the other 


41 The variation is that the test for identification is not Fy but a test based on the correlation 
between scalar x and scalar zt. 
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based on the size distortions exhibited by the Wald test. Both are intuitively 
reasonable but interestingly they yield different criterion that, furthermore, are 
sensitive to different aspects of the specification. While this analysis is confined 
to linear models, it clearly reveals that the issues of defining poor identification 
and testing for identification are more subtle than has been previously realized. 
To date, there has been far less attention on the issue of testing for identifi- 
cation in nonlinear models. The simple reason is that in the general nonlinear 
model the key determinant of local identification is the derivative matrix Go 
which depends on ĝo. This problem can be circumvented as demonstrated by 
Wright (2001) who proposes a test for identification in the context of GIV es- 
timation. However, the issues described in the previous two paragraphs remain 
to be addressed in this context, and so we do not explore the test further here. 


8.3 Inference When the Long Run Variance is 
Estimated by an HAC Estimator with 
bp =T 


It can be recalled from Theorem 3.2 that the asymptotic variance of Or depends 
on the long run variance of the sample moment, S. Given a consistent estimator 
of S, Sr say, it is possible to perform inference about 6) based on this limiting 
distribution theory. For example, Theorem 3.2 implies that inference about 4; 
can be based on, 


4 N(0,1) (8.73) 


where Vr.;; is the i — it” element of 


Vr = (Grr) WrGr(6r)] Grr} WrSrWrGr(br)|Gr(6r)'WrGr(6r)|"1 

(8.74) 
Section 3.5 contains a review of a number of alternative estimators for Sr with 
the choice between them depending upon the requisite assumptions about the 
dependence structure of {f (v+, 0o)}. The most general of these is the class of 
heteroscedasticity autocorrelation covariance (HAC) estimators. To calculate 
the HAC estimator, it is necessary to choose a kernel, w(.), and a bandwidth, 
br. These components must satisfy certain restrictions if the resulting covari- 
ance matrix estimator is to be consistent. However, these conditions leave a fair 
amount of latitude. While the literature has provided guidance on the relative 
merits of popular choices of kernel, there is no data based method for bandwidth 
selection that does not involve some kind of arbitrary decision by the practi- 
tioner. This is particularly undesirable as simulation evidence suggests that 
subsequent inferences about 0o can be sensitive to the choice of bandwidth.*? 
In view of these problems, Vogelsang (2003) proposes using a HAC estimator 


42 See Sections 3.5.3 and 6.3. 


306 Alternative Approximations 


with br = T. Such a rule has the twin advantages of being simple and definitive 
but it violates the conditions for consistency of the covariance matrix estimator 
with the result that (8.73) no longer holds. However, Vogelsang (2003) shows 
that it is possible develop an alternative asymptotic theory that can be used as 
a basis for inference about o in this case. In this section we briefly review this 
theory. 

To facilitate the discussion, it is useful to introduce the following notation. 
In view of the structure of the kernels in Table 3.3 and our current focus on 
the case in which br = T, we write w(i/T) for wir. Below it is necessary to 
consider situations in which the argument of the kernel is a difference, and to 
avoid excessive notation we set k; ; = w((i—j)/T). Let Sp,—r denote the HAC 
estimator in equation (3.54) with br = T, and set gi(0) = T~'S\i_, f(v, 0). 
Finally, let Êr be the GMM estimator based on weighting matrix Wr. It is 
important for the arguments below that Êr is consistent for 09. One implication 
of the analysis below is that Si Lh does not satisfy Assumption 3.7 and so is not 


a valid weighting matrix. Therefore, Sar is only used to perform inference 
after estimation is completed. 

This approach to inference works for the general case in which it is desired 
to test the hypothesis that the parameters satisfy a nonlinear set of restrictions, 
r(69) = 0. However, it is more convenient to introduce this framework in the 
context of the simple case in which r(0o) = 09,;. The more general case is then 
covered at the end of the section. The arguments presented are heuristic and 
the interested reader is refered to Vogelsang (2003) for a rigorous justification. 

Suppose then that it is desired to test perform inference about 69. The 
natural starting point is the analagous statistic to the one appearing in (8.73), 
namely 

TY? (67,4 — 80,1) 


(8.75) 
\/ Vor=T, ii 
where Vir=T ii is the į — it” element of 
Vora za (Gr(ôryYWrGr(ôr) t Gr(Êr) Wr Son-rWrGr(êr) 
x [Gr (67) WrGr(êr)]! (8.76) 


Since Sor=T is only used for inference and not estimation, the limiting behaviour 
of the numerator is the same as before. However, this time it is useful to express 
this limiting behaviour in terms of a Brownian motion.** To this end, let 4 
be the (p x 1) selection vector whose it” element is one and whose remaining 
elements are all zero. Using similar arguments to Section 3.4.2, it follows from 


(3.26) that 


TY? (67,5 — 90,1) = uT? (Êr — 0o) 
= —u[GoWGo] 1GLWT!?gr(0o) + op(1) (8.77) 


43 See Section 5.4.2.1. 
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Equation (8.77) and the Functional Central Limit Theorem (Assumption 5.8) 
together imply that 


TY? (67; — boi) => —0,[G>WGo]-'G\WS/?' B,(1) (8.78) 
= oB,(1) (8.79) 


where B,(r) is a (q x 1) Brownian motion and ø is the (positive) square root of 
t, [GoW Go] GW SW Go [GOW Go] ~tu. 

Now consider the denominator of (8.75). Clearly the denominator depends 
on a =r. To develop the limiting behaviour of Vir= =r, we start with Sor=T 
and gradually add in the surrounding matrices that appear in (8.76). Vogelsang 
(2003) shows that 


T-1T-1 T 
Serer = T! dm eT 9¢(O7)T 9m (Or)! + XO f (vi, Or) ki.rgr (Or)! 
l=1 m=1 i=1 
T-1 
+T X` (kre — kresi)gr (Or) ge(Or)' (8.80) 
t= 


where 
dm, = (km, — km1) — (Km+1,e — km+1,e4+1) 
Now consider Cr = Gr (6r)'WrS>,-TWrGr(6r). Since the first order condi- 
tions — equation (3.12) — imply Gr(6r)/Wrgr(6r) = 0, it follows from (8.80) 
that 
T-1T-1 N r 7 N 
Cr =T X X dmeGr(ÔryYWrTgm(Ôr)Tgelôr)WrGr(ôr) (8.81) 
£=1 m=1 
To characterize the limiting behaviour of Cy, it is useful to introduce the step 
function Dr(r) defined on r € [0,1] as Dr(r) = D(y) for j/T < r <  +1)/T, 
jg =1,2,...,T—1 where D(x/T) = [w((a+1)/T) —w(a/T)| — [w(a/T) —w((a— 
1)/T)]. Using this step function, Cr can rewritten as 


T-1T-1 
Cr = -TX X T?Dr((m- 0)/T)Gr(6r)'Wrgm (Or) Ge(Or)' WrGr (Or) 
l=1 m=1 
1 1 A Re 
= -f I T? Dr(rı ca r2)Gr (Or) WrT™? gir T] (êr) 
0 0 


x T? gp) (67) WrGr(6r)dridre (8.82) 


where gjrj(0) = Tt SET f(v, 0). The advantage of this representation is 
that T?Dr(r) — w” (r) where w”(.) denotes the second derivative of w(.) on 
(—1,1). 

From (8.82), it is clear that the limiting behaviour of Cr depends on 
Gr(6r)'WrT/?g,.7)(67). Using similar arguments to the derivation of The- 
orems 5.9-5.10 in Section 5.4.2.1, it can be shown that 


Gr(bp) Wel! grrr) > GWS” BB, (r (8.83) 
[rT] o q 
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where BB,(r) denotes a (qx 1) Brownian Bridge.** It follows from (8.82)—(8.83) 
that 


1 pl 
t Vor=Tli > o° | i —w" (rı — r2)B Bı (r1)BBı(r2)drıdr2 (8.84) 
o Jo 


Combining (8.75), (8.79) and (8.83), it follows from the Continuous Mapping 
Theorem that 


T2 (67,4 — 0,1) By(1) 
Pirra {fo fo —w" (rı — r2)BB,(r1)BB,(r2)dr,drg}4/2 


(8.85) 
A comparsion of this distribution with the conventional limit distributions in 
(8.73) indicates that the difference lies purely in the denominator.*” Since B,(1) 
and BB,(r) are independent by construction, it follows that the limiting distri- 
bution in (8.85) is that of the ratio of two independent random variables. This 
structure further implies that the limiting distribution is a mixture of normals. 
Notice also that the distribution in (8.85) depends on the kernel. We return to 
this issue below. 
The more general hypothesis r(69) = 0 can be tested using the Wald-type 
test, 
Wr = Tr(6r)'[RO6r)Vop=r R(6r)']'r(6r)/s 


where R(0) = Or(@)/0’ and r(.) is s x 1. The following lemma gives the limiting 
distribution of this statistic under the null hypothesis. The necessary regularity 
conditions pertain to the consistency of Êr, r(.) and the behaviour of the partial 
sums and derivatives. Since all three have been presented in Chapters 3 and 5, 
we do not explictly repeat them in the text here. 


Lemma 8.2 Limiting Distribution of Wr under Ho : r(6)) = 0 
If Assumptions 8.1-8.5, 8.7-8.10, 5.8, 5.7 and 5.8 hold then: 


-1 


Wr => B,(1) Er —w" (rı — r2)BBs(r1)BB,(r2)'dridr2 |  B.(1)/s 


The limiting distribution in Lemma 8.2 depends only on the number of 
restrictions, s, and the kernel. Keifer and Vogelsang (2002a) show that the 
Bartlett kernel has superior local power properties in a simpler setting, and so we 
confine our discussion to this kernel here.*° With the Bartlett kernel, w” (x) = 0 
for all x 4 0 but has the drawback that w(x) is not differentiable at z = 0.47 


44 See Definition 5.2 in Section 5.4.2.1. 

45 Recall that Bı(1) = N(0, 1). 

46 Keifer and Vogelsang (2002a) consider the case in which the null hypothesis involves a 
set of linear restrictions on the regression parameters in a linear model. 

47 Recall that the Bartlett kernel is w(x) = 1 — |x| for x € (—1,1) and zero elsewhere; see 
Section 3.5.3. 
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However, Keifer and Vogelsang (20026) show that w’’(0) can be replaced by —2 
with the result that the limit distribution in Lemma 8.2 becomes: 


Baye | BB,(r)BB,(r)'dr|~'B,(1)/s (8.86) 


Critical values for this distribution are given in Table 8.5 for p < 12. Both 
Keifer and Vogelsang (2002a) and Vogelsang (2003) provide critical points for 
p < 30. Keifer and Vogelsang (2002a) also provide critical points for a variety 
of other kernels for the case of p = 1. 

Vogelsang (2003) evaluates the finite sample performance of this approach 
via a small simulation study. The sample design involves the IV estimation of 
the parameters of linear regression model with errors that are generated by an 
AR(p) process for p = 0,1,2. The null hypothesis of interest is that 69; < 0 
versus the alternative that 69,; > 0, and so the test statistic is given in (8.75) 
evaluated at ĝo; = 0. As comparison, Vogelsang (2003) also considers the 
performance of the conventional approach using (8.73) for the case in which 
Sr is calculated using an HAC with a quadratic spectral kernel and bandwidth 
selected by a method proposed in Andrews (1991). The evidence suggests that, 
of the two, the test based on the HAC with br = T exhibits an empirical size 
that is closer to the nominal level and the asymptotic approximation is good at 
T = 200. Size adjusted power calculations indicate that this ranking is reversed 
under the alternative although the difference between the two tests is relatively 
small.48 


Table 8.5 
Critical points for Wr with the Bartlett kernel 

p 10% 5% 1% 

1 14.28 23.14 51.05 
2 17.99 26.19 48.74 
3 21.13 29.08 51.04 
4 24.24 32.42 52.39 
5 27.81 35.97 56.92 
6 30.36 38.81 60.81 
7 33.39 42.08 62.27 
8 36.08 45.32 67.14 
9 38.94 48.14 69.67 
10 41.71 50.75 72.05 
11 44.56 53.70 74.74 
12 47.27 56.70 78.80 


Source: Reprinted from Advances in Econometrics, 17, T.J. Vogelsang, “Testing in GMM 
models without truncation”, pp. 199-233, copyright (2003), with permission from Elsevier. 
Notes: the figures represent the critical points for the tests at the 10%, 5% and 1% significance 
levels. 


48 Vogelsang (2003) considers the case with and without pre-whitening and recolouring. 
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Example: Hansen and Singleton’s (1982) Consumption Based Asset 
Pricing Model 

It can be recalled that the underlying economic theory for this model implies 
that f(vz,90) is a serially uncorrelated process, and so the majority of our 
inferences have not involved the use of an HAC matrix to estimate the long 
run variance. Nevertheless, for completeness, we use this approach to inference 
to obtain alternative confidence intervals for the parameters in the model with 
VWR. It follows from (8.85) and Table 8.5 that an approximate 95% confidence 
interval for ĝo, is given by 


Op, + V23.144/ Vpr=T, i /T 


Using the iterated estimator based on Wr = Sans the interval for yo is (—3.082, 
4.415) and the interval for ôo is (0.985, 1.026). Both are slightly wider than the 
corresponding intervals reported in Table 3.8.49 © 


8.4 Summary 


In this chapter, we review three alternative methods for approximating the 
finite sample behaviour of the GMM estimator and its associated statistics. 
These three are: (i) the bootstrap; (ii) an asymptotic theory developed for 
the case in which the parameter vector is weakly identified by the population 
moment condition; (iii) and an asymptotic theory designed to provide a better 
approximation when the weighting matrix is based on a heteroscedasticity auto- 
correlation covariance (HAC) matrix estimator. All three of these alternative 
approximations are relatively new to the GMM literature, and so the associated 
statistical theory is less comprehensive than that derived using the conventional 
theoretical framework reviewed in Chapters 3 and 5. While important progress 
has been made in each case, lacunae remain: 


e The bootstrap: Since the bootstrap is based on resampling, it has the po- 
tential to provide asymptotic refinements for all the inference procedures 
discussed in Chapters 3 and 5. However, to date, these asymptotic refine- 
ments have only be proven to occur within a class of nonlinear dynamic 
models that includes some, but not all, the types of model in Table 1.1. 
The basis on resampling means that the method can also be computa- 
tionally burdensome in nonlinear models, but this burden can be reduced 
by using the approximate bootstrap. In the types of model in Table 1.1, 
the data generation process is unknown and therefore the non-parametric 
bootstrap must be used. With dynamic data, the non-parametric boot- 
strap involves resampling blocks of data and, to date, there are no defini- 
tive guidelines on how these blocks should be chosen in the settings that 
arise in GMM estimation. 


49 See Section 3.6. 
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e Weak identification: There have been two main branches of this literature. 
The first branch has focused on showing the sensitivity of the conventional 
asymptotic approximation to the quality of the identification. This branch 
of the theory provides an important caveat to the conventional asymptotic 
analysis because weak identification is encountered in practice. The second 
main branch of this literature focuses on the development of inference 
techniques that are robust to weak identification. To date, such techniques 
are only feasible in the context of nonlinear dynamic models for settings 
in which the number of parameters is relatively small. 


e HAC estimation with bandwidth equal to sample size: An attractive feature 
of this approach is that it provides a simple, definitive rule for bandwidth 
selection. Although the long run variance is not consistently estimated, it 
is still possible to perform asymptotically valid inference about the param- 
eters. However, this choice of bandwidth cannot be used for estimation, 
and, to date, there are no comparable asymptotically valid procedures for 
inference based on the overidentifying restrictions. 


Chapters 3 through 8 have exposited the main aspects of the statistical the- 
ory of GMM estimation and its associated inference techniques. Throughout 
the discussion, the various techniques have been illustrated using one of the ex- 
amples from Section 1.3, namely the consumption based asset pricing model. In 
the following chapter, we consider the estimation of the remaining four examples 
from Section 1.3. 


9 


Empirical Examples 


Throughout the preceeding chapters, the various facets of GMM estimation have 
been illustrated using the consumption based asset pricing model described in 
Section 1.3.1. In this chapter, we present empirical analyses for the other four 
models described in Section 1.3. 

Section 9.1 implements the mutual fund evaluation measure proposed by 
Chen and Knez (1996). This discussion illustrates the potential sensitivity of 
inferences based on the overidentifying restrictions test to the choice of covari- 
ance matrix estimator. We also present results based on a modified measure of 
performance evaluation that involves a non-negativity constraint. As a conse- 
quence, the choice of f(v,,@) does not satisfy the restrictions on the derivative 
imposed by Assumption 3.5 because the derivative of the minimand is not de- 
fined at all values of the parameter space. This non-existence causes problems 
for gradient methods of optimization and so necessitates the use of an alternative 
algorithm. 

Section 9.2 explores whether the conditional capital asset pricing model can 
explain the variation across international stock prices indices. The adequacy 
of the specification is assessed using both the overidentifying restrictions test 
and also tests for structural stability. The analysis indicates that inferences 
about structural stability can be very sensitive to whether inference is based 
on the Wald, LM or D tests discussed in Section 5.4. One possible explanation 
is that the LM and D tests use the full sample GMM estimator instead of the 
restricted GMM estimator. To assess the impact of this substitution, alternative 
versions of the LM and D test are introduced. The evidence indicates that 
this substitution, while asymptotically valid, has a considerable impact on the 
behaviour of the tests in this example. 

Section 9.3 reports estimation results for Eichenbaum’s (1989) model for in- 
ventory holdings, and examines whether the production smoothing or produc- 
tion cost smoothing hypothesis best captures aggegate behaviour in non-durable 
manufacturing industries. The analysis of the production smoothing model is 
based on both the original moment condition derived in Section 1.3.3 and also 
on two alternatives derived by applying curvature altering transformations to 
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the original. The iterated GMM estimates are seen to be sensitive to the trans- 
formation employed, and these are contrasted to the continuous updating GMM 
estimates that are insensitive to such transformations. 

Section 9.4 explores whether a stochastic volatility model can capture the 
time series properties of the daily U.S. dollar - Canadian dollar exchange rate. 
This is another example in which the moment condition does not satisfy the 
conditions placed on the derivative matrix by Assumption 3.5. This time the 
problem is due to the presence of the absolute value function. In this con- 
text, this problem has been treated by using a polynomial approximation in the 
neighbourhood of the point at which the derivative is not defined, and we exam- 
ine the sensitivity of the estimation results to the width of this neighbourhood. 
In this example, the parameter vector is heavily overidentified, and so we ex- 
amine whether any of the moment conditions are redundant using the moment 
selection criteria described in Chapter 7. 


9.1 Mutual Fund Performance Evaluation 


Section 1.3.2 describes a method for the evaluation of mutual fund performance 
proposed by Chen and Knez (1996). It can be recalled that a fund receives a 
zero performance measure if 


Mri, dy) = Elr™X!59] — 1 = 0 (9.1) 


where r;” is the payoff on the mutual fund, d; is the stochastic discount factor 
and X; is a (N x 1) vector of payoffs on the traded assets included in the 
benchmark set. A fund receives a positive performance measure if A(r7”, dy) > 0. 
To distinguish this measure from another that is discussed below, we follow Chen 
and Knez (1996) and refer to A(r}”, d+) as the LOP measure where the acronym 
stands for “Law of One Price”, the theorem from which the measure is deduced. 
As in Section 1.3.2, we re-express this condition for a zero evaluation as 


E[Q:X!50] — 1n41 =0 (9.2) 


where Qi = (Xj,r?")’, o is a (N x 1) parameter vector defined in Section 1.3.2 
and 1y4i isa (N +1 x 1) vector of ones. It can be recognized that (9.2) consti- 
tutes a set of N +1 population moment conditions in N parameters and so it is 
possible to test the null hypothesis of a zero evaluation using the overidentifying 
restrictions test described in Section 5.1. Notice that the alternative for this test 
statistic is that E[Q:X;ôo] — lw+41 # 0 and so is broader than A(r?, d+) > 0. 
Therefore, a significant statistic provides evidence against a zero performance 
evaluation but does not necessarily provide evidence of positive performance. 
Inspection of (9.2) reveals that it is linear in the parameters. It there- 
fore follows by similar arguments to Section 2.1 that ôo is globally identified 
if rank{ E[Q;X;,]} = N, the number of parameters in the notation here. Given 
the structure of Q+, a sufficient condition for identification is therefore that 
E[X,X,] is nonsingular which might reasonably be anticipated to hold in the 
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absence of any obvious redundancies in the definition in the benchmark set. The 
linear structure can also be exploited to deduce a closed form solution for the 
GMM estimator of 69. Using similar arguments to Section 2.2, it can be shown 
that 


br = (M pWrMr) M} Wrln+1 (9.3) 


where Mr = T! 37,1 Q:X¢. 

Chen and Knez (1996) evaluate the performance of sixty eight funds both 
individually and in the aggregate. For the aggregate analysis, funds are grouped 
according to their investment objective and then the group “average” return is 
constructed as the return on an equally weighted portfolio of the funds in the 
group. Chen and Knez (1996) consider five investment objectives: growth (G), 
income-growth (IG), income (I), stability-growth-income (SGI), and maximum 
capital gain (MC). In this section, we evaluate fund performance for these group 
averages. For the benchmark set, Chen and Knez (1996) use the risk free rate 
and twelve industry based portfolios.! The data used here are the same as Chen 
and Knez’s (1996) study and constitute monthly returns for the period January 
1968 through December 1989.7 This gives a total of T = 264 observations. 

To implement the estimation, Chen and Knez (1996) estimate the long run 
variance using a HAC estimator with the Bartlett kernel and a bandwidth 
br = 17 and base inference on the two step estimator.? Therefore, we begin 
our analysis with this configuration and then consider the impact of using the 
iterated estimator and also using two alternative covariance matrix estimators. 
Since there is no theoretical reason to set br = 17, we also report results using 
an HAC estimator with a Bartlett kernel and the bandwidth selected via Newey 
and West’s (1994) data-based method. Finally, we consider the sensitivity of 
the results to the use of “prewhitening and recolouring” by reporting results 
based on Sp = Sop in (3.58). In all calculations, the first step weighting matrix 
is set equal to the identity matrix.* 

Table 9.1 reports the results for the group averages. The top line of the 
table represents the configuration used by Chen and Knez (1996) and the re- 
sults are close to those reported in their Table 1. The slight differences likely 
reflect differences in estimation routine. This evidence clearly fails to reject 
the null of a zero performance evaluation at the 5% level in every case although 
there is evidence against the null at the 10% level for stability-growth-income 
group. It can be seen that iteration has no qualitative impact on this conclusion 
—even though convergence required between ten and fifteen steps in each case.’ 
However, the results are far more sensitive to the choice of covariance matrix 
estimator. If the bandwidth is estimated from the data then the chosen value 


See Chen and Knez (1996) for further details. 
I am extremely grateful to Peter Knez for providing me with this data. 
See Section 3.5.3 for a discussion of HAC estimators. 
4 Chen and Knez (1996) do not report which weighting matrix they used on the first step. 
5 Although a closed form solution exists, this is only exploited in the first step and esti- 
mates on subsequent steps are actually obtained using a numerical optimization routine. The 
convergence criterion is set at ey = 1076; see Section 3.2. 
6 Convergence criterion is implemented with eg = 10-8 and Imax = 20; see Section 3.6. 
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is always zero or one. This, in turn, leads to an increase in the test statistics 
in every case. In two cases, the income and stability-growth—-income groups, 
the tests now provide evidence against zero performance at the 5% level. Inter- 
estingly if Sz is used then the statistics fall between those obtained by fixing 
and estimating the bandwidth. With this choice of covariance matrix estimator, 
the null of a zero performance evaluation is not rejected at the 5% level for all 
groups although only marginally for the income and stability-growth—-income 
groups. 


Table 9.1 
LOP measures of average mutual fund performance 


Investment objective 


Stat. G IG I SGI MC 
Str = Syac(17) 
JP 1.870 1.095 2.233 2.846 2.460 
p—value 0.172 0.295 0.135 0.092 0.117 
JË 1.860 1.109 2.195 2674 2.513 
P © p—value 0.173 0.292 0.139 0.102 0.113 
Sr = Sy ac (b) 
JP 2,478 1.524 4.944 4.042 2.579 
p—value 0.116 0.217 0.026 0.044 0.108 
JË 2.410 1.586 4.824 4.084 2.546 
os p—value 0.121 0.208 0.028 0.043 0.111 
ST = SSE 
JP 2.170 1.250 3.588 3.431 2.267 
p — value 0.141 0.264 0.058 0.064 0.132 
JË 2.147 1.261 3.331 3.213 2.268 


p— value 0.143 0.261 0.068 0.073 0.132 


Notes: All HAC estimators are calculated with the Bartlett kernel. Sr = Suac(17) denotes 
the case in which Sp = SHac in (3.54) and br = 17; Sr = Si ac(b) denotes the case in 
which r = Suac and br is selected using Newey and West’s (1994) method; Êr = Sse is 
given in (3.58); J) and JË are the overidentifying restrictions test in (5.2) based on the 
two step and iterated estimators respectively; p — value is the p-value of the overidentifying 
restrictions test on the line above. 


Clearly the evidence is sensitive to the choice of covariance matrix estimator. 
Unfortunately, no guidance is available regarding the performance of these es- 
timators in this type of setting and so it is impossible to know which version 
of the test is more reliable here. Even allowing for the sensitivity to the choice 
of covariance matrix estimator, the results do not provide compelling evidence 
against a zero performance evaluation. As remarked by Chen and Knez (1996), 
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such a conclusion might not be considered particularly surprising given the ag- 
gregate nature of the group return. However, it is also possible that the choice 
of measure may understate performance. While it is true that the LOP measure 
is zero if the mutual fund does not enlarge the investment opportunity set for 
uninformed investors, it is also possible for the LOP measure to be zero even 
though the opportunity set has been expanded in the sense that some investor 
prefers to hold the fund over any constant composition portfolio based on X+. 
This concern motivates the introduction of the modified evaluation measure that 
we now consider. 

This modified measure rests on the assumption that the securities market 
satisfies a no-arbitrage condition, that is all securities with a positive pay-off 
have a positive price.’ If this condition is satisfied then the stochastic dis- 
count factor is strictly positive and X; can be priced by d} = (X;60)* where 
(X}6o)+ = maz{X;5o, 0}, so that (1.24) is replaced by 


E|Xıd]] = 1n 
A similar modification is made in the evaluation measure to yield 
dt (fd) = Elr? (Xio) "] — 1 = 0 


which Chen and Knez (1996) refer to as the NA-measure with the acronym 
standing for “no-arbitrage”. Chen and Knez (1996) show that the NA measure 
is only zero if the fund does not expand the investment opportunity set and 
that it is positive if there is at least one investor who would prefer to hold the 
fund rather than any constant composition portfolio constructed from X+. 

In principle, it is possible to test zero performance using the NA measure in 
the same way as before. The only difference is that the overidentifying restric- 
tions test is now based on the population moment condition 


E(Q:(X;00)*] — In+1 = 0 (9.4) 


However, this difference raises an important issue. The population moment 
condition in (9.4) involves the function (X/6)* that is not differentiable with 
respect to 6 at X65 = 0 and so does not satisfy Assumption 3.5, one of the 
regularity conditions for our asymptotic distribution theory. However, there are 
grounds for anticipating that Theorem 5.1 still holds in this case; see Hansen, 
Heaton, and Luttmer (1995). Therefore, we follow Chen and Knez (1996) and 
proceed under the assumption that this extension is possible. 

The functional form in (9.4) is far more complicated than the LOP case 
due to the presence of the non-negative operator (X;6)*. Experimentation 
with different starting values revealed that this type of nonlinearity creates 
problems for fminu, the gradient method in MATLAB. In fact, it is noted in 
the User’s Guide to the Optimization Toolbox that fminu does not work well 
if the function is discontinuous. Although the minimand is continuous here, 
the derivative does not exist at X;ô and hence is not a continuous function. 


T See Ingersoll (1987) [Chapter 2]. 
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Therefore, the estimations are performed using an alternative routine fmins 
within the Optimization Toolbox that employs a simplex search method. This 
algorithm is less efficient than fminu but does not require evaluation of the 
gradient of the minimand; see Mathworks (2000) for further details. 

Further experimentation using fmins indicated that the minimum on the 
first step estimation is located in the neighbourhood of the LOP solution given 
by (9.3). Therefore, this value is used as the starting value for the first step 
estimation. The convergence criterion for the numerical optimization is em = 
1076. For the iterated estimation, the convergence criterion is once again €p = 
1076 but the maximum number of steps is increased to Imaz = 100. In every 
case, the numerical optimization failed to converge after 100,000 iterations on 
the first few steps of the iterated estimation. Experimentation indicated that 
increasing the number of replications made no difference either to the likelihood 
of convergence or the value of the minimand at the end. Nevertheless, after these 
initial steps convergence did occur on each step and the iterated estimation did 
itself converge. Therefore, all calculations are performed with a maximum of 
100,000 replications on each step. However, this pattern of behaviour means 
that the reported values for the overidentifying restrictions test on the second 
step should be regarded as upper bounds on these statistics. 

The results are given in Table 9.2. Once again the first row of the table 
replicates the configuration reported in Chen and Knez (1996) and we start our 
discussion with this case. Our results are once again slightly different from those 
reported in Chen and Knez (1996) but qualitatively similar. In comparison to 
the LOP measure, the NA measure leads to larger values for the overidentifying 
restrictions test in every case although none of the tests are significant at the 
5% level. Once again, iteration tends to reduce the value of the test statistic. 
However, as with the LOP measure, the results are very sensitive to the choice 
of covariance matrix estimator. When the bandwidth is estimated from the data 
it is invariably zero or one, and this leads to statistics that are significant at the 
5% level for both the income and stability-income-growth groups. The use of 
pre-whitening and recolouring leads to statistics that are lower but nevertheless 
still marginally significant at the 5% level for the income and stability—income— 
growth groups. 

In their study, Chen and Knez (1996) also report distributions of p-values 
from applying these tests to individual funds. Although we do not replicate this 
part of their analysis here, it is worth briefly noting what they found. Of the 
sixty eight funds considered, they report that 8% of the funds provide evidence 
against zero performance at the 5% significance level using the LOP measure 
and that this percentage increases to 13.2% when the NA measure is used. It 
is unclear precisely how to interpret such percentages because even if all the 
tests are independent and the null is true in every case then we would expect to 
reject the null 5 per cent of the time. However, if this evidence is taken at face 
value then there would appear to be evidence against zero performance using 
these measures. At this point, it is worth reassessing what the measure actually 
captures. It can be recalled that these measures compare fund performance to 
a benchmark set of passively held portfolios. Chen and Knez (1996) argue that 
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this may be too low a benchmark since financial information is reported in the 
media. They therefore propose a “conditional” measure in which the benchmark 
consists of portfolios whose weights can vary in response to publicly available 
information. These conditional measure can also be implemented using GMM 
and the type of testing stratgey employed above; see Chen and Knez (1996). 


Table 9.2 
NA measures of average mutual fund performance 


Investment objective 


Stat. G IG I SGI MC 
Sr = Syac(17) 
JP 2.135 1.494 2.501 3.306 2.869 
p— value 0.144 0.222 0.114 0.069 0.090 
JÉ 1.919 1.179 2.211 2.781 2.628 
p—value 0.166 0.278 0.137 0.095 0.105 
Îr = nacb) 
JË 2.862 1.879 5.497 4.720 3.180 
p—value 0.091 0.171 0.019 0.030 0.075 
JÉ 2.692 1.838 4.765 4.757 3.135 
ee p—value 0.101 0.175 0.029 0.029 0.077 
Sr = SSE 
JP 2.772 1.656 4.232 4.604 3.061 
p—value 0.096 0.198 0.040 0.032 0.080 
JÉ 2,649 1.634 4.040 4.124 2.968 


p— value 0.104 0.201 0.043 0.042 0.085 


Notes: see Table 9.1. 


9.2 Conditional Capital Asset Pricing Model 


Section 1.3.3 describes the conditional capital asset pricing model (CCAPM). 
This model has been used to investigate the pricing of a wide variety of assets. 
In this section, we follow Harvey (1991) and investigate whether the model can 
explain the variation in the returns across international stock markets. 

It can be recalled from Section 1.3.3 that the model implies a set of pop- 
ulation moment conditions involving the conditional first two moments of the 
asset returns. It is convenient to express these moment conditions more com- 
pactly here. To this end, we set Oio = (ô; o, Ôm o)’, Wie(Oi) = Tie — 2416; and 
Um,t(Om) = Tm,t — Z;-10m Where (as a reminder) ri į denotes the excess return 
on holding the market portfolio for country i, rm, is the excess return from 
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holding the “world market” portfolio below, z;_; denotes a vector of relevant 
economic and financial variables contained in the information set 0;_,. The 
population moment conditions in (1.34) and (1.36) can be written as 


Elfi(v:; 9:,0)] = Elait(9i0) D 4%] = 0 (9.5) 
where 
wi (ði) 
ai (0i) = Um,t (Om) (9.6) 


dies (Opa yx Oe = Um,t (Om) Ui,t (5:)24_15m 


If the model is correct then these moment conditions hold simultaneously for all 
countries. Harvey reports results based on (9.5) for both individual countries 
and groups of countries. For brevity here, we only consider GMM estimation 
on a country by country basis and so 0; o is estimated based on (9.5) for each 
country i. 

Given this approach to the estimation, the condition for local identification 
of bio is that rank{E[Of;(v;, 0;,0)/00,]} = p where, in this case, p = 2n, and 
n, denotes the dimension of z,_;. For this model, the derivative matrix is as 
follows, 


a0, = Ai tli) D %-1%_4 (9.7) 
where 
—1 0 
A; (9) = 0 —] 
A AY 
and 
AY) = Um, tlm) + ie Ole, 1m 
AY = —2ttm,t (Sm) 2-155 i Um,t (Om) Ui,t (ds) + tui t (Ôi) Zi Abe, 


Using (1.34), it follows from (9.7) that 


0 iU „Oi ~ , 
E Pilve bro) = ElAit Q zt—-124—1] (9.8) 
06; 
where 
7 —1 0 
Ait = 0 =—1 
Um,t(ôm,0)°, —ui tli 0)Um,t(ôm,0) 


It can easily recognized that the matrix in (9.8) is rank p provided E[z—12;_1] 
is nonsingular. The latter condition holds as long as there are no linear redun- 
dancies among the information variables. 

As mentioned above, we present here results from estimating the model for 
individual countries. It should be noted that this approach does not impose all 
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the restrictions of the model. The underlying theory implies that (9.5) holds 
for all i at the same value for ôm,o. However, the latter restriction is ignored 
when the estimation is on a country by country basis. Therefore, as noted 
by Harvey, some caution must be exercised in interpreting the results. If the 
model is not rejected for individual countries then this does not necessarily mean 
that the model can simultaneously explain the variation in returns across these 
countries. On the other hand, if the model is rejected for a particular country 
then that provides valuable information about the failings of the underlying 
theory. 

The data are the same as those used in Harvey’s study.’ The observations 
are monthly and span the period 1970:02 to 1989:05; this gives a total of 232 
observations. The world market portfolio is the Morgan Stanley Capital Inter- 
national index (MSCI) and represents a weighted combination of the returns 
on a variety of world-wide investments; see Harvey (1991) for specific details. 
Both the world return, rm, and the country i return, ri, are expressed in U.S. 
dollars in excess of the holding period return on the T-bill that is closest to 30 
days to maturity. The information variables are denoted by z;_; above. The 
vector z; contains: a constant; rm,t; a dummy variable for the month of January; 
the U.S. term structure premia, calculated as the return for holding a 90-day 
U.S. T-bill for one month less the return from holding a 30 day T-bill; the U.S. 
default risk spread calculated as the yeld on a Moody’s Baa rated bond less the 
yield on a Moody’s Aaa rated bond; the dividend yield on the Standard and 
Poor’s 500 stock index less the return on a 30-day U.S. T-bill. Given this choice 
of information variables, the model yields q = 18 moment conditions and p = 12 
parameters. Harvey (1991) reports results for seventeen countries.? However, 
for brevity, we restrict attention to the G-7 countries. 

With regard to the specifics of the GMM estimation, Harvey (1991) uses a 
first step weighting matrix proportional to the identity matrix, and estimates 
the long run variance by sy. Notice that the latter is consistent because 
ai,t(00,;) is a martingale difference given Q,—ı — provided the model is correctly 
specified. Therefore, we use these weighting matrices as well but also consider 
the sensitivity of the results to the use of (T~! J} Iz ® %2;)7! as the first 
step weighting matrix. In each case, the starting values for 6; are the least 
squares estimates from the regression of r;¢ on z,—1, and those for ôm are the 
corresponding estimates only with rm, as the dependent variable.1° 

It can be seen from Table 9.3 that the relatively insensitive to the choice 
of first step weighting matrix, and that in each case iterated estimation yields 
identical statistics. However, the qualitative conclusion can be sensitive to itera- 
tion. For both the U.S. and Japan, the overidentifying restrictions test statistic 


8 I am grateful to Eric Ghysels for providing me with the data. 

9 These countries are: Australia, Austria, Belgium, Canada, Denmark, France, Germany, 
Hong Kong, Italy, Japan, The Netherlands, Norway, Spain, Sweden, Switzerland, the United 
Kingdom, and the United States. 

10 Notice that these estimates are the GMM estimates based on (1.34) alone. 
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is insignificant at the 10% level after two steps but becomes significant after 
iteration. It is the results based on the iterated statistics that replicate those 
reported in Harvey (1991). Overall, there is evidence in favour of the model for 
Canada, France, Germany, Italy and the U.K. and evidence against the model 
for Japan and the U.S. 


Table 9.3 
Overidentifying restrictions tests for the conditional 
capital asset pricing model 


Case A Case B 

Country Two step Iterated Two step Iterated 
Canada 3.669 3.156 3.145 3.156 
0.721 0.789 0.790 0.789 

France 8.316 10.308 7.861 10.308 
0.216 0.112 0.248 0.112 

Germany 3.183 3.476 3.229 3.476 
0.785 0.747 0.780 0.747 

Italy 9.538 9.821 8.337 9.821 
0.146 0.132 0.214 0.132 

Japan 8.461 14.984 10.428 14.984 
0.206 0.020 0.108 0.020 

U.K. 1.083 1.104 1.094 1.104 
0.982 0.981 0.982 0.981 

U.S. 7.450 10.764 7.499 10.764 
0.277 0.096 0.281 0.096 


Notes: Case A denotes wi) = 10°I1g, and Case B denotes wi) =(T-1 ei [3 ® zez). 
The numbers below the test statistics are the associated p-values. 


It can be recalled that the innovation of the CCAPM is to allow the invest- 
ment betas to vary over time. It is therefore natural to question the assumed 
form of this variation is appropriate. If variation is present but the assumed 
model is incorrect, then it would be anticipated that this would manifest itself 
in structural instability. While the overidentifying restrictions test provides a 
general diagnostic for misspecification, it is not specifically designed to test for 
structural instability. Furthermore, it can be recalled from Section 5.4 that 
the overidentifying restrictions test can have size equal to power against certain 
types of structural instability. Motivated by these concerns, Ghysels (1998) ar- 
gues that it is important to submit the CCAPM to formal test of structural 
stability. He pursued the issue in the context of CCAPM’s for domestic U.S. 
assets and found widespread evidence of instability. Therefore, we now consider 
if similar evidence is present here. 
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Since there is no reason to associate any instability with a particular mo- 
ment in time, the analysis is based on the unknown break point versions of 
the tests described in Section 5.4.2. It can be recalled from this earlier discus- 
sion that the construction of these tests involves the calculation of the “known 
break point” tests for all possible break points within an interval II. Conven- 
tional practice is to set II = [0.15,0.85] and we followed this rule here in the 
absence of any alternative guidance. However, it is worth considering what this 
rule actually implies here. For 7 = 0.15, the first sub-sample involves only 
thirty-five observations. This means that the sub-sample estimation attempts 
to retrieve twelve parameters from eighteen moment conditions based on just 
35 observations! As can readily be imagined, under such circumstances, con- 
vergence can be very sensitive to the sample. In this example, it turns out 
that the numerical optimization runs into problems when m = 0.15 but not 
when 7 = 0.85 (and so the second sub-sample consists of thirty-five observa- 
tions). The problems stem from the near singularity of Si rlr). Since this 
occurred for every country, this particular break point is dropped from the cal- 
culations and so the tests are calculated for break points tg = 36,37, ...197 
— or, equivalently, 1973.02 through 1986.07. Convergence also proves a prob- 
lem for sub-samples if the maximum allowable number of iterations, Imaz is 
set too high. As a result, the estimations are performed with Imaz = 6, that 
is a six-step estimator. All calculations use the first step weighting matrix 
WY = (TOOT, B @ 22). 

Table 9.4 reports the Sup—, Av— and Exp— versions of the tests for both 
parameter variation and also stability of the overidentifying restrictions.!! It 
can be recalled from Section 5.4 that the three tests of parameter variation are 
asymptotically equivalent under both the null hypothesis (i.e. no parameter 
variation) and local alternatives. However, the three statistics exhibit very 
diverse behaviour here. The LM versions of the test are insignificant at the 10% 
level in every case, the Wald versions are significantly larger and significant at 
the 1% level in every case, and the D versions are orders of magnitude larger 
still.!? To date there is no guidance available regarding which version of these 
tests is more reliable in finite samples. However, one possible explanation for 
this discrepancy is that the LM and D tests are calculated using the full sample 
GMM estimator rather than the restricted estimator. While the two estimators 
are asymptotically equivalent under the null and local alternatives, it may be 
that the sample size here is too small for this equivalence to apply. To explore 
this possibility further, the test statistics are also calculated using the following 
versions of the LM and D tests based on the restricted estimator 07(7), 


2 
ps Z 


LM?(T) = X Tidir(6r(m); 7) Vir ()dir(Or(m); 7) (9.9) 


i=l 


11 Critical points for these statistics are given in Tables 5.5 and 5.6. 

12 The extremely large values of the D statistic occur when one of the sub-samples is small, 
because in these cases it turns out here that the estimator of the long run variance in the 
small sub-sample is ill-conditioned and so close to singularity. 
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D (r) = TlJ(Or(r), r(x); r) — J(O1.r(7),42,7(m);m)] (9.10) 
where 
dir(6r(m);t) = Gir(Or(m)37) Sirle) g,r (irl); 7) 
Vir = [GirG@r(m);7) Sir(0)'Gir(Or(m); r) 


and Si v(t) denotes a consistent estimator of S; based on Or (m); see Section 
5.4.1 for further definitions. Both these statistics are asymptotically equivalent 
under the null and local alternatives to the three tests for parameter variation 
discussed in Section 5.4.1. Interestingly, for the example here, it can be seen from 
Table 9.4 that the tests based on LM%(7) and D¥,(7) yield statistics that are 
closer to the Wald based tests than their counterparts based on the full sample 
estimator. Nevertheless, substantial differences remain, although in nearly every 
case the tests based on Wr(7), LML} (r) and D} (r) yield qualitatively the same 
conclusion. 


The overidentifying restrictions based tests tend to be insignificant. There- 
fore, if these results are taken at face value then collectively they suggest that 
the misspecification is due to parameter variation. However, this would seem 
to be a big “if”. As mentioned above, the discrepancies in the tests raise suspi- 
cions about the adequacy of the asymptotic approximation here. Furthermore, 
we note that many of the Sup— tests yield estimated break points close to the 
beginning or end of the sample, and it can be recalled that this may also be 
an indicator that asymptotic theory is not a good approximation. Research is 
currently in progress to investigate the reliability of these tests in the types of 
setting considered here. 


Due to these concerns about the adequacy of the asymptotic approximation, 
we do not pursue this example further. However, it is worth briefly summarising 
Ghysels’s (1998) conclusions regarding the ability of the CCAPM to explain the 
prices of domestic U.S. assets. His sample consists of monthly data from 1927:01 
through 1988:01 and thus contains more than three times as many observations 
as the sample in our example above. Inference about structural stability is based 
on the Sup-LM test based on the statistic in (5.77) using II = [0.2,0.8]. The 
evidence indicates that the model is validated by the overidentifying restrictions 
test in many cases but that there is substantial evidence of misspecification due 
to neglected parameter variation. Interestingly, Ghysels (1998) reports that in 
many cases the unconditional version of the capital asset pricing model provides 
more accurate forecasts of asset prices than its conditional counterpart. These 
results highlight the importance of not relying purely on the overidentifying 
restrictions test to assess the adequacy of the specification. 
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Table 9.4 
Structural stability tests for the conditional capital 
asset pricing model 


Country Statistic Sup— Date Av— Exp— 
Canada W 53.196 1985.08 31.444 22.404 
LM 23.151 1977.11 14.636 9.817 
LM? 43.276 1977.01 27.467 18.895 
D 4.0x10° 1973.07 3162.427 oe) 
D? 271.293 1973.05 42.895 130.559 
O 22.219 1983.12 13.324 8.239 
France W 248.733 1973.09 36.225 119.279 
LM 25.036 1980.09 16.162 10.054 
LM? 56.713 1973.10 24.450 23.363 
D 5.7x 10 1973.10 3.5x 108 oe) 
D? 2.0 104 1974.01 303.526 oe) 
O 30.667 1983.07 17.254 11.980 
Germany W 116.670 1986.04 36.640 53.726 
LM 19.791 1982.11 12.532 7.155 
LM? 40.085 1975.04 23.798 15.938 
D 7.8x107 1973.09 7.0x105 oo 
D? 1.4x104 1973.12 294.689 oo 
O 19.821 1974.01 11.388 7.199 
Italy W 144.147 1986.02 37.277 66.986 
LM 18.466 1978.03 11.881 6.848 
LM? 43.198 1976.05 21.689 18.239 
D 7.0 x107 1986.07 5.2x105 oo 
D? 3.9x105 1973.07 3775.701 oo 
O 23.598 1984.12 17.364 9.700 
Japan W 117.877 1973.02 33.620 54.157 
LM 18.715 1977.01 12.429 7.251 
LM? 44.201 1973.05 24.508 17.215 
D 2.8x 101° 1986.02 1.7x108 oe) 
D? 3.6x105 1973.05 2408.333 oo 
O 30.142 1974.10 22.460 12.521 
U.K. W 64.166 1973.02 21.053 26.996 
LM 18.566 1981.12 11.594 6.704 
LM? 37.351 1986.07 20.611 14.801 
D 6.9x 10" 1986.04 6.6 x 10° oe) 
D? 1.0x104 1986.04 129.555 oe) 
O 21.528 1985.06 9.983 7.957 


continued 


9.3 Inventory Holdings by Firms 325 


Table 9.4 (cont.) 
Structural stability tests for the conditional capital 
asset pricing model 


Country Statistic Sup— Date Av— Exp— 
US. WwW 121.367 1973.05 33.069 55.611 
LM 17.225 1982.11 11.185 6.556 
LM? 32.128 1976.10 20.865 13.328 

D 1.2x10° 1986.02 7.3x 108 oe) 

D? 4.9x104 1986.02 350.100 oo 
O 27.926 1986.07 16.748 10.084 


Notes: W denotes the versions of the statistics based on the Wald test in (5.75), LM denotes 
the versions of the statistics based on the LM test in (5.77), LM? denotes the versions of the 
statistics based on the LM test in (9.9), D denotes versions of the statistics based on the D 
test in (5.78), D? denotes versions of the statistics based on the D test in (9.10), O denotes 
versions of the tests based on the statistic in (5.80). oo denotes results too large to represent 
as conventional floating-point values. 
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Section 1.3.4 describes Eichenbaum’s (1989) model for inventory holdings by 
firms. The key innovation is that the model provides a framework for deter- 
mining whether the production smoothing hypothesis or the production cost 
smoothing hypothesis better captures firm behaviour. In this section, we exam- 
ine which hypothesis — if either — captures aggregate inventory holding behaviour 
in non-durable manufacturing industries for the U.S. 

It can be recalled from Section 1.3.4 that the difference betwen the two 
hypotheses rests on the the presence or absence of the stochastic shock 1 in the 
cost function. If v; = 0 for all t then stochastic cost shocks are absent and so the 
only incentive for holding inventories is the desire to smooth production levels. 
However, if 4% Æ 0 then stochastic cost shocks are present and this produces an 
incentive to hold inventories to smooth both levels and costs. For simplicity, 
we use the term “production smoothing version of the model” to refer to the 
case in which v, = 0, and the term “production cost smoothing version of the 
model” to refer to the case in which v, 4 0. 

Eichenbaum (1989) shows that the production smoothing version of the 
model implies the population moment condition, 


Elzthi+1(vo)] = 0 (9.11) 
where as a reminder 
hei (ho) = Tega — {ào + (Ao) He + 89 Le-1 + St41 — obo Se (9.12) 


where wo = (Ao, Go, G0)’ and z E Qi. The production cost smoothing version of 
the model implies the population moment condition 


El{hi+e(Wo) — poht+1 (Yo) bz] = 0 (9.13) 
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It is clear from (9.11) and (9.13) that the production smoothing and production 
cost smoothing hypotheses imply different restrictions on the data. To empha- 
size an important aspect of this difference, it is useful to compare the two sets 
of moment conditions in the case where po = 0. From (9.11), it can be seen that 
the production smoothing version of the model implies that hi+1(qWo) is orthog- 
onal to all elements of the information set Q4. In contrast, from (9.13) (lagged 
one period), it can be seen that the production cost smoothing version implies 
only that hi41(wWo) is orthogonal to any element of the information set Q;—1. 
Furthermore, Eichenbaum (1989) shows that if the production cost smoothing 
version of the model is correct then hi+1(wWo) is not orthogonal to any element 
of the information set Q. This key difference indicates a way of discriminating 
between the two versions of the model: if the production smoothing version 
is correct then (9.11) is valid, but if the production cost smoothing version is 
correct then (9.11) is invalid but (9.13) is valid. Therefore, the overidentifying 
restrictions test associated with these two sets of moment conditions can reveal 
which, if either, of the two competing hypotheses are correct. 

The primary focus of such inventory models is to estimate of the speed of 
adjustment of actual inventories to the target (or desired) level of inventories. 
Eichenbaum (1989) shows that the speed of adjustment is 1 — Ao within either 
version of the model. This means that firms adjust inventories toward their 
target level at (1 — Ao)100% per month. This interpretation also means that 
Ao must lie between zero and one if it is to make economic sense. Economic 
theory also implies a restriction on ġo. It can be recalled that o = 1 — 6070/0 
where (ôo, Yo, Qo) are parameters of the cost function. Given their roles in the 
cost function, all three of these parameters would be positive if the model is 
correctly specified. This, in turn, translates into the restriction that o < 1. 
One final aspect of the parameterization needs to be noted. To simplify the 
estimation, it is customary to fix the value of the discount factor 9 = 0.995 a 
priori, and we follow this practice here — as did Eichenbaum (1989). 

Eichenbaum (1989) estimates both versions of the model using aggregated 
data for all non-durable manufacturing industries in the U.S. as well as aggre- 
gated data for six specific industries using monthly data for 1959:1 through 
1984:12.1° Here we restrict attention to the aggregate data for all non-durable 
manufacturing industries, and use a revised and enlarged data set spanning 
1959:1 through 1998:5, yielding a sample of 473 observations. The data are 
compiled by the Bureau of Economic Analysis (BEA), the US Department of 
Commerce.'* The series represent end of the month inventories and sales of 
finished goods. The data are adjusted to constant chained 1992 dollars and are 
seasonally adjusted. 

One immediate problem is that these data on inventories and sales are not 
stationary because they trend over time. Eichenbaum (1989) presents results 
based on detrending the data via either first differencing or using a quadratic 


13 The six industries in question are: tobacco, rubber, food, petroleum, chemicals and 
apparel. 
14 I am grateful to David Doorn for providing me with the data. 
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time trend. Although there is not complete unanimity in the literature on the 
appropriate method, these data are most commonly detrended using a quadratic 
time trend and so we use that method here.!° 

We first consider estimation of the production smoothing version of the 
model based on (9.11). Following Eichenbaum (1989), we set 


a= (1, It, St, lı, St—1) 


and so there are q = 5 moment conditions in p = 2 unknown parameters. The 
first step weighting matrix is set equal to the inverse of the instrument cross 
product matrix, (T~! 7}, z:z,)71. Within this model, h++1(%0) is martingale 
difference with respect to Q; and so the long run variance can be consistently 
estimated by s SU- 

The condition for local identification of 69 = (Ao, œo)” is that the rank of 
E|Of (vı, 09)/30'] equal two. For this model, the derivative matrix is given by 


ElOf (v4, 00)/30'] = Elz:%%]M (00) (9.14) 
where Xt = (Le41, Sy and 
T (0.99542)-1 —1 0 
K= 0 —(0.995)— 


As discussed in Section 3.1, 
rank{ E[ð f (vi, 09)/00']} < min{rank (E[z,%4]) , rank (M (00))} 


and so a necessary condition for identification is that both E|z:ž;] and M (0o) 
have rank equal to two. Inspection of M (0o) indicates that this matrix is of 
rank two. However, it is impossible to say anything regarding E[z,%,] a priori. 

Table 9.5 reports results for three different starting values, (A, ¢) = (0.5, 0.5), 
(0.9, 0.9), (1.5, 1.0), using a convergence criterion of 1076. It can be seen that in 
each case the first step estimates are the same to three decimal places. However, 
the estimates are not identical to a higher precision and this explains why the 
iterated estimates are different. In spite of these differences, all the estimates 
for À exceed one and all the overidentifying restrictions statistics are significant 
at the 1% level, both of which are indicative of misspecification. 

Instead of proceeding to the production cost smoothing version of the model, 
we first explore some alternative approaches to estimation of the production 
smoothing model based on scaled versions of the moment condition. To motivate 
these alternative approaches, it is useful to consider a plot of the first step 
minimand associated with estimation based on the original moment condition 
in (9.11). As can be seen in Figure 9.1, this first step minimand is very flat in 
the area of the starting values.‘® Of most concern is the fact that the minimand 
is very flat in the dimension of A, the parameter of most interest. So for this 
data, it would appear that the population moment condition in (9.11) is not 


15 See Doorn (2003) for further discussion. 
16 Parenthetically, we note that this flatness is exhibited by the minimand on subsequent 
steps and explains the sensitivity of the iterated estimates. 
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Table 9.5 
GMM estimates of the production smoothing model based on 
c(9o) ELF (vt, 90)] = 0 
Starting values, (A, ¢) 
c(0o) Statistic (0.5, 0.5) (0.9, 0.9) (1.5, 1.0) 
1 (AY, Ad) (1.003, 0.947)  (1.003,0.947) (1.003,0.947) 
TQr(6W)  53362295.59  53362295.59  53362295.59 
(AS, AP) (1.003, 0.945)  (1.003,0.945) (1.003,0.945) 
JP 26.151* 26.151* 26.151* 
(AY ,d%) (1.135, 0.937) (1.003,0.945) —_(1.003,0.945) 
JY 25.218* 26.154* 26.154* 
(a=) (AP, AY) (0.699, 0.901) diverges (1.550,0.880) 
TQr(6) 1.0 x 10° 4.4x108 
(AP, A) (0.740, 0.907) (1.427,0.894) 
JP 42.597* 60.879* 
(ÂG, eH) (0.744, 0.905) (1.407,0.894) 
JẸ? 34.229* l 41.460* 
—0.995A (AW, P) (0.791, 0.927) (0.791,0.927)  (0.790,0.927) 
TQr(ÔP)  37526635.32  37526635.32  37526635.32 
(AS d2) (0.815, 0.926)  (0.815,0.926) (0.815,0.926) 
JP 27.118* 27.118* 27.118* 
(AS, eK) (0.818,0.926) — (0.818,0.926) — (0.818,0.926) 
JY 25.993* 25.993* 25.993* 


Notes: AD, $2) denote the GMM estimators on the it” step, JË denotes the overidentifying 


restrictions test on the it” step; i = . denotes the iterated estimator, * denotes significance at 
the 1% level. 


particularly informative about the parameters. A similar qualitative conclusion 
has been drawn by researchers using other aggregate inventory data, and this 
has motivated an interest in scaling the Euler equation residual, hi41(wWo), in 
an attempt to provide a moment condition that is more informative about Ao 
over the range of economically meaningful values. A number of scalings have 
been used, here we consider two. Schuh (1996) bases estimation on the scaled 
moment condition!” 


(1— ào) Elzehesi(yo)] = 0 (9.15) 


17 Tt should be noted that Schuh (1996) uses this transformed moment condition to estimate 
the model using establishment level data and not the aggregate data used here. 
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Figure 9.1: First step minimand for the production smoothing model — original 
version 


Durlauf and Maccini (1995) base estimation on the scaled moment condition 


— boo E|zthħi+1(Y0)| = 0 (9.16) 


Both these transformations are examples of a “curvature altering transformation 
of the population moment condition” that is discussed in Section 3.7. From 
this discussion, it can be recalled that such transformations alter the first order 
conditions in overidentified models and so, in turn, alter the estimates in general. 
However, the estimator is still consistent. We now consider the effects of each 
of these transformations upon our results. 

First, consider the case in which the moment condition is multiplied by 
(1—Ao)~+. Figure 9.2 contains a plot of the first step minimand associated 
with estimation based on (9.15) — again using Wr = (T~1 5}; 2+2,) 77. It can 
be seen that the transformation serves to create a ridge at A = 1 so that there 
is now a boundary around the economically meaningful range of values for À.18 
Nevertheless, the minimand still appears to be very flat within this region. It 
should also be noted that the transformation has created a function with at 
least two minima: one associated with in the economically relevant range and 
one associated with a value of À greater than one. One potential numerical 
disadvantage of this transformation is that if A = 1 then (1 — A)~? is infinite. 


18 Since the minimand is infinite at Ao = 1, the plot is truncated over the range A € 
(0.95, 1.05). 
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Figure 9.2: First step minimand for the production smoothing model — scaled 
by (1—A)~* 
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Figure 9.3: First step minimand for the production smoothing model — scaled 
by — 6A 


9.3 Inventory Holdings by Firms 331 


To circumvent this problem, (1—A)~1! is computed as (1—A+eps)~+ where eps = 
2.2204 x 10716 and represents the floating point relative accuracy in MATLAB 
calculations. Estimation results are reported in Table 9.5 using the same starting 
values as before. It can be seen that this time the first step estimates are sensitive 
to the starting values. If the starting values are (0.5, 0.5) then the estimates of À 
are less than one. If the starting value is (0.9, 0.9) — and hence close to the ridge — 
then the estimates diverge on the first step with the result that Ssvu is singular. 
If the starting value is (1.5,1.0) then the estimate of A is greater than one. 
It is the local minimum associated with Mr > 1 that is the smaller of the two. 
However, this does not necessarily mean that it is these estimates that should be 
chosen. The underlying economic model implies that there should be a value of 
Oo that both satisfies the population moment condition and is also economically 
meaningful. This does not preclude the possibility of other minima outside the 
economically relevant part of the parameter space. Furthermore, the asymptotic 
distribution theory only requires local identification. Therefore, if there is a 
single well-determined minimum within the economically meaningful part of 
the parameter space, then it is reasonable to adopt the estimates associated 
with that minimum. The mimimum associated with Âr < 1 would appear to 
satisfy the criteria above. However, even then, the point estimate for Ao implies 
that inventories are adjusting towards the desired level at about 30% a month 
which is considered to be implausibly low. The overidentifying restrictions tests 
also indicate misspecification. Interestingly, the values for these statistics are 
substantially larger than with the previous model. 

Now consider the case in which the moment condition is multiplied by —G9Ao. 
Figure 9.3 contains a plot of the first step minimand associated with estimation 
based on (9.16) - again using Wr = (T~! 57; 2;) 7}. It can be seen that this 
transformation has created more curvature in the minimand. While it is not 
clear exactly where the minimum lies, it certainly looks more clearly defined. 
This is borne out by the estimation results: the estimates are actually identical 
to six decimal places on each of the steps reported. As can be seen from Table 
9.5, the estimates of Ap are all less than one, but once again the implied speed of 
adjustment is implausibly low. The overidentfiying restrictions tests are again 
significant at the 1% level. 

It is evident that the estimates reported above are sensitive to the choice of 
transformation. Unfortunately, there is no obvious way to choose between them. 
It can be recalled that this sensitivity is one motivation for using the continuous 
updating GMM estimator. Figure 9.4 plots the minimand of the continuous up- 
dating GMM estimator. It can be seen that this function exhibits considerably 
more curvature. Figure 9.5 rotates this plot to reveal the valley more clearly 
to the eye — note, that the axis are therefore different from the previous plots. 
Given the potential instabilities of the continuous updating GMM minimand,!* 
the estimation is implemented using all the iterated estimates reported above 
as starting values. The results are presented in Table 9.6. Interestingly, the 
different starting values lead to two different sets of estimates one with Ar > 1 


19 See Section 3.7. 
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Figure 9.5: Continuous updating GMM minimand — with 90° rotation 
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Table 9.6 
Continuous updating GMM estimates of the 
production smoothing model 


Starting values, (A, ¢) (Ar, ér) Jr 
(0.818, 0.926) (1.407,0.894) (0,865,0.935) 25.134* 
(0.744,0.905), (1.003, 0.945) (1.161,0.935) 25.134* 


(1.135,0.937) 


Notes: Jr denotes the overidentifying restrictions test, * denotes significance at the 1% level. 


and one with Âr < 1. The associated overidentifying restrictions statistics are 
actually identical to ten decimal places. Although the estimates differ slightly 
from those above, the overall conclusion is the same. Taking the minimum 
associated with Âr < 1, the implied speed of adjustment is implausibly low and 
the overidentifying restrictions test is significant at the 1% level. 

While the estimates differ across the various estimations of the production 
smoothing model, the basic conclusion is the same: the model is rejected with 
this data. Eichenbaum (1989) reports a similar finding. We therefore now 
consider the production cost smoothing version of the model, and concentrate 
exclusively on the original version of the population moment condition in (9.13). 

The condition for local identification is derived for a special case of the 
model in Section 3.1, and since the condition is not particularly instructive, we 
do not pursue it further here. The estimation is implemented using the same 
instrument vector as before but there are now g = 5 moment conditions in p = 3 
parameters due to the introduction of p. The first step weighting is once again 
(T-1571 22,)~1. Eichenbaum (1989) shows that the long run variance can 
be consistently estimated by sy within this model as well, and so we use this 
estimator here. 

Experimentation with a variety of starting values indicates that the first step 
minimand possesses multiple minima. As above, there is one involving À < 1 
and one with À > 1. Since there appears to be only one minima in the econom- 
ically meaningful area of the parameter space, we focus attention on the results 
associated with this minimum. It can be seen that from Table 9.7 that these 
estimates provide evidence in favour of the specification. The implied speed of 
adjustment is approximately 70%, ér < 1 and the overidentifying restrictions 
tests are insignificant at the 10% level (although only just in one case). Table 
9.7 also contains the results from a continuous updating GMM estimation using 
the iterated estimates as starting values. The only important difference is that 
the continuous updating estimator of o is —0.933 as opposed to the iterated 
GMM estimates of approximately —0.2. Therefore all the results are consistent 
with Eichenbaum’s (1989) finding that the production cost smoothing model is 
not rejected using aggregate non-durables industry data. 

Although our analysis is confined to aggregate non-durable industry data, 
it is worth noting that Eichenbaum (1989) reports a similar pattern of results 
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for the six industries considered in his study. Durlauf and Maccini (1995) also 
find the production smoothing hypothesis is rejected using aggregate data, and 
report evidence supportive of a variant of the production cost smoothing model. 
While the production smoothing hypothesis is rejected using aggregate indus- 
try data, there is evidence that this may be an artefact of aggregation. Using 
an establishment level data set, Schuh (1996) finds that the speeds of adjust- 
ment within the production smoothing model are an order of magnitude higher 
at establishment level than their counterparts estimated using industry data 
constructed by aggregating the establishment data. 


Table 9.7 
GMM estimates of the production cost smoothing model 


Statistic First step Second step Iterated Continuous updating 


AT 0.295 0.330 0.332 0.256 
s.e.(Ap) 0.075 0.082 0.083 0.066 
or —0.205 —0.238 —0.223 —0.933 
s.e.(gr) 0.586 0.557 0.554 0.801 
pr 0.931 0.925 0.925 0.927 
8.e.(pr) 0.022 0.022 0.022 0.024 
Jr ; 3.671 4.136 3.629 

p — value : 0.160 0.126 0.163 


Notes: s.e.(.) denotes the standard error of the estimator calculated via (3.59), Jr denotes 
the overidentifying restrictions test and p — value its associated p-value. 


9.4 Stochastic Volatility Model of Exchange 
Rates 


Section 1.3.5 describes the stochastic volatility model that has been used in a 
number studies of financial time series. In this section, we follow Melino and 
Turnbull (1990) and investigate whether this model can capture the time series 
properties of the daily U.S. dollar — Canadian dollar exchange rate. 


Melino and Turnbull (1990) base their estimation on the following moment 
conditions: 
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Elw;(0)| = 
E[w? (90)] =. exp|2 px T 202 = 
Elw;(0)] = 
Elwé(6)] — 3ezplájie + 802] = 


E[}w(9)|] — (2/7) exp[us + 0.507] = 


oooooqoo ooo;°o 


E||w:(90)|°] — 2(2/m)'/*exp[3ue2 + 4.502] = (9.17) 
E||we(9o)|we(Ao)] = 
E|wz(90)we-j(90)] = 
E[|wz(90)we—j (90) |] — €1,5(A0) + €2,5(00) = 
E||we(90)|we-j(80)] — mao) = 
E[w;? (00) w7_;(80)] — n;(40) = 
for j = 1,2,...10 where 
y(t) — aod; — (1 + Bods) y(t-1) 
wilo) = dya) pe (9.18) 
and 
lialo) = (2/7) P expus + o7(1 + (1 + mod)’) — 0.5p8¢5d(1 + nod)? 07] 
l2,5(00) = (2/1)? pood? (1 + nod)?" (1 — 28 (pood? (1 + mod)?~") 
x exp[2per + 07 (1 + (1+ nod)’ )] 
m;(9) = (2/m)/?po¢od'/?(1 + nod)?" exp[2py, + 02 (1 + (1 + m04)’)| 
nalo) = {4po6Gd(1 + md) OTY + l}exp[Ape + 403 (1 + (1 + nod)’)] 
Me = —60/No 
oz = Gd/[l— (1+ nd)’] 


and ®(.) denotes the cumulative distribution function of a standard normal 
random variable. To simplify the estimations, Melino and Turnbull (1990) fix 
the value of yo a priori and so the parameter vector is 69 = (a0, Bo, 50, 70; Co, Po): 
Melino and Turnbull (1990) try three different values for yo, namely zero, one 
and two, and find the results are relatively insensitive to the choice. Throughout 
this section, we set yo = 1 a priori. 

The condition for local identification is not particularly instructive for this 
model, and so is omitted. However, inspection of (9.17) does reveal that not 
all the moment conditions have the potential to provide information about all 
the parameters. The following schematic summarizes the potential informa- 
tion content of the moment conditions with the latter being represented by the 
associated functions of the data: 

we (80) 

wi (90) 
dome ( ~~ % 
|we(o)|w2 (0) 
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| — ao; Bo, ôo, no, Co 


a \ — a0, Bo, ðo, no, Go, [Pol 


J 
|we(A0)|we_; —> ao; Bo, do; no, Co; PO 


Clearly all of these moment conditions involve (ao, 8o) and the majority involve 
ôo, No, Co. However, it is only the moment conditions involving |w;(00) we; (9)], 
w? (00) w7_; (90) and |w:(40)|we-j, j = 1,2,... that involve po, and of these, only 
the latter can reveal the sign of po. 

Inspection of (9.17) also reveals that the set of moment conditions involves 
the expectations of absolute values of functions of w;(09). Such functions are 
non-differentiable at zero. This scenario is outside the theoretical framework 
developed earlier in the book beacuse the asymptotic theory is predicated on 
Assumption 3.5 that states Of(v;,0)/00’ exists for all v € V. Melino and Turn- 
bull (1990) argue that since the necessary derivatives exist almost everywhere 
then it is reasonable to anticipate that the same asymptotic theory goes through 
with appropriate modification of the conditions. This argument is valid but its 
formal proof requires empirical process methods, and is not pursued here.?? 
While this argument can be used to extend the asymptotic theory to cover this 
type of model, there still remains the secondary issue of what value to assign 
the derivative of |w,(0)|, say, should w;(@) = 0 in the computations. In fact, this 
is not a vacuous question since this does happen in this model with this data 
due to floating point accuracy of computer calculations. Melino and Turnbull 
(1990) report that they set this derivative to zero. One disadvantage of this 
approach is that the derivative can change dramatically in response to a slight 
perturbation of the data, and Vetzal (1992) finds that this can cause problems 
for the numerical optimization routine.2! Therefore, Vetzal (1992) proposes us- 
ing a sixth order polynomial to approximate the behaviour of the absolute value 
function in the neighbourhood of zero and then basing the derivative on the ap- 
proximating polynomial. This approximation works as follows. The derivative 
of |w| is replaced over the range w € [—e, €] by the derivative of 


p(w) = ao + aww + agw? + azw? | asw? | agw” | agw (9.19) 


where the weights, {a;i}, are chosen to ensure that p(w) mimics the beha- 
viour of |w| at w = 0 and the boundaries of the neighbourhood. Specifically, 
the constraints are: p(0) = 0, p(e) = €e, p'(e) = 1, p”(e) = 0, p(—€) = «, 
p'(—e) = —1, p” (—e) = 0 - where p'(w) = Op(w)/Ow and p” (w) = 0? p(w) /dw?. 
Vetzal (1992) shows that these constraints imply a; = 0 for i = 0,1,3,5, 


20 However, see Andrews (1994) or Newey and McFadden (1994) [Section 7]. 
21 For example, if w:(0) is —107 1° then the derivative of |w:(0)| is —1 but if we(0) is 1071 
then the derivative of |w:z(0)| is 1. 
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az = 15/(8€), a4 = —5/(4e) and ag = 3/(8e°). The advantage of this approach 
is that the derivative makes a smooth transition from —1 to 1 in the neighbour- 
hood of zero. The disadvantage is that the derivative is actually incorrect for 
w E {(—e,0)U (0, ¢)}. This approximation is employed in the calculation of the 
derivatives of |w:(0)|, |we(@)we_j|, and |w:(O)|wi_s (9) for s =0,1,...10. Vetzal 
(1992) observes that |w?(@)| resembles a parabola, and is quite flat around zero. 
Consequently there is no need to use the approximation in this case and so the 
derivative of |w?(0)| is only modified by setting it to zero at w(0) = 0.72 


We now turn to the estimation of the model. The data set consists of the 
daily spot Canada-U.S. exchange rate for the period January 2, 1975 to Decem- 
ber 10, 1986, and is identical to the one used in Melino and Turnbull’s (1990) 
study.?3 This gives a total of 3,010 observations on w;(99). For this data, the 
minimum time interval, d, is one. With regard to the specifics of the GMM es- 
timation, all results reported in this section are calculated in the following way. 
An iterated GMM estimation is performed with the maximum number of itera- 
tions set at Imaz = 11 and a convergence criterion set at eg = 10°.74 In practice, 
convergence rarely occurred before this ceiling was met. Our experience was 
that convergence was not uniform, and some experimentation with larger val- 
ues for Imax did not yield obvious improvement. We attribute this to the highly 
nonlinear nature of the moment conditions as a function of 09. The first step 
weighting matrix equal to 1087}, and the long run variance is estimated using 
the prewhitened and recoloured HAC matrix with the bandwidth selected by 
Newey and West’s (1994) data-based method and a Bartlett kernel, that is Ss 
in (3.58). The starting values are the estimates reported by Melino and Turnbull 
(1990), that is 6(0) = (0.042, —0.00054, —0.384, —0.091, 0.153, —0.110).?° 


We begin our empirical analysis of this model by considering the sensitivity 
of the results to the treatment of the derivative of the absolute value functions. 
Table 9.8 contains estimation results using four different neighbourhoods of the 
approximation: [—e,¢] with € = 10~?,10~-+,10~-°,0. In the first three cases, 
the derivative is based on (9.19) in the way described above. For e = 0, the 
neighbourhood collapses to the point zero and in this case the derivative of 
|wz(8)| is only modified by setting it to zero at w:(@) = 0. It can be seen 
that both the estimates and standard errors exhibit some sensitivity to the 
approximation used in the calculation of the derivative. However, there are 


22 I am extremely grateful to Ken Vetzal for both drawing the issue discussed in this 
paragraph to my attention, and also for providing me with the material upon which the 
discussion is based. 

23 I am extremely grateful to Angelo Melino for providing me with both the data and also 
an unpublished appendix to their paper prepared by Ken Vetzal that derived the gradients of 
the moment conditions. 

24 See Section 3.6 

25 It should be noted that the specifics of our estimation differ from those in Melino and 
Turnbull (1990) in three respects: (i) they use an HAC estimator with br = 50; (ii) the first 
step weighting matrix is Se evaluated at a Method of Moments estimator based on a set 
of six undisclosed moments; (iii) the iterated estimator is iterated an unspecified number of 
steps. 
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clearly greater differences between the two step and iterated estimator for a 
given value of e. 


Table 9.8 
Sensitivity of GMM estimates to derivative calculation 
in stochastic volatility model 
e=0 e= 107" e= 1074 e= 107? 
2-st. iter. 2-st. iter. 2-st. iter. 2-st. iter. 


ar 0.051 0.096 0.012 0.089 0.044 0.061 0.069 0.090 
s.e. 0.090 0.060 0.100 0.067 0.097 0.083 0.072 0.061 
Êr 0.001 —0.001 —0.000 —0.001 —0.001 —0.001 —0.000 —0.001 
s.e. 0.001 0.001 0.001 0.001 0.001 0.001 0.003 0.001 
ôr 0.334 —0.234 —0.356 —0.258 —0.368 —0.249 —0.315 —0.291 
S.€. 0.050 0.106 0.043 0.107 0.042 0.061 0.062 0.138 
ÑT 0.079 —0.056 —0.085 —0.061 —0.088 —0.059 —0.075 —0.069 
s.e. 0.012 0.025 0.010 0.025 0.010 0.014 0.018 0.033 
r 0.163 0.114 0.172 0.121 0.176 0.123 0.156 0.127 
s.e. 0.015 0.029 0.013 0.028 0.012 0.018 0.019 0.032 
pr 0.279 —0.179 —0.321 —0.192 —0.324 —0.310 —0.216 —0.138 
s.e. 0.356 0.696 0.320 0.632 0.299 0.672 0.375 0.469 


Jr 41.00 38.52 42.57 38.50 42.28 39.71 41.26 38.39 
p— value 0.47 0.58 0.40 0.58 0.42 0.53 0.46 0.59 


Notes: Estimation is based on (9.17) for j = 1,2,...10. 2 — st and iter. denote the two step 
and iterated estimators respectively. s.e. denotes the standard error of the estimator on the 
line above. e indexes the width of the neighbourhood of approximation for the absolute value 
function; see text. 


given value of e. Regardless of the permutation chosen, the results are qualita- 
tively the same: the overidentifying restrictions test is insignificant at the 10% 
level; the estimated parameters of the volatility process are individually signifi- 
cantly different from zero at the 5% level, although the remaining estimates are 
insignificant. These findings are also reported by Melino and Turnbull (1990).?° 
Since the overidentifying restrictions test suggests the model is consistent 
with the data, we now consider the implications for the nature of the conditional 
variation of the exchange rate. In this context, two questions naturally arise. 
Is the volatility stochastic? — and if so, then for how long does the impact of 
shocks to the volatility process last? We now consider these issues in turn. 


26 The most striking difference between our results and those of Melino and Turnbull (1990) 
are in the estimate and standard error of ôr. Vetzal (1997) estimates a stochastic volatility 
model for short term interest rates using different choices of covariance matrix estimator 
and finds that the standard errors can be very sensitive to the choice of covariance matrix 
estimator. 
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Within this model, volatility is stochastic if Ço # 0. A test of Ho : Co = 0 
versus Hı : Ço # 0 can be performed using any of the statistics described in 
Section 5.3. For simplicity, we use the Wald test in (5.43) which for this simple 


case reduces to r 
Êr | 
s.e.(Cr) 


Under the null, it follows from Theorem 5.7 that Wr 2 x7. Inspection of 
Table 9.8 reveals that this hypothesis is overwhelmingly rejected in every case. 
Therefore, volatility does indeed appear to be stochastic. The impact of the 
stochastic shocks on volatility is governed by the data generation process for 
the latent variable x(7,). This process is 


Wr=T 


In[x(7)| = dod + (1 + nod)in[x(% — d)] + od! ulr) (9.20) 


One way to assess this impact is to consider the impulse response function. 
Within this approach, the impact of a single shock on current and future val- 
ues is calculated under the assumption that there are no future shocks, that is 
assuming all subsequent values of u(7;) are zero.?” So for the following calcula- 
tions, it is assumed that u(7,) = u and u(%) = 0 for all t > to. Recalling that 
d = 1 for this example, it follows from (9.20) that the impact of this type of 
shock on In[x(7)] is given by (1 + no)" Cou, for t > to. A common summary 
statistic for impulse response functions is the half-life. The half-life of the shock 
is defined to be the interval of time it takes for the impact of the shock to be 
halved, that is ty; such that 


0.50 = (1 + no) Cot (9.21) 
Without loss of generality, we can set to = 0 and solve (9.21) to obtain 


In[0.5] 


hy = ———\—|_— 
Mi In[l + no] 


To illustrate the half-life, we focus attention on the case in which e = 1078. The 
associated values for the iterated estimator of Ĥr yield tn; = 11.013 and so a 
half-life of approximately eleven days. 

Taken at face value, the evidence appears to suggest that the stochastic 
volatility can capture the time series movements of this exchange rate. How- 
ever, the inferences described above rest on asymptotic theory and the sim- 
ulation results reported by Andersen and Sørensen (1996) cast doubt on the 
adequacy of this theory as an approximation to finite sample behaviour in these 
types of model.?® These simulation results also indicate that the quality of the 
asymptotic approximation can be very sensitive to the choice of moments, and 
furthermore that certain moment conditions may be redundant. Therefore, we 


27 See Hamilton (1994) [Chapter 1] for further discussion. 
28 See Section 6.3. 
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estimate the model using different subsets of the moment conditions and com- 
pare the results using the moment selection criteria described in Section 7.3. To 
this end, we divided the moment conditions into five groups as follows (where 
once again the moments are represented by the associated functions of the data): 


M1 = {wi(0o), i=1,2,3,4; |wf(O0)|, k= 1,3} 

M2 = {w;(00)wi—j(90), j = 1,2,---10} 

M3 = {|we(90)we-j(90)|, j = 1,2,... 10} (9.22) 
M4 = {\w:(8)|we_j(90); j = 0,1,-..10} 

M5 = {w?(00)w7_;(0)], j =1,2,...10 


The model is estimated using four combinations of groups: M1, M2, M3, M4; 
M1, M3, M4, M5; M1, M2, M4, M5; M2, M3,M4,M5.?9 Table 9.9 reports 
the values for MSC(c) and RMSC(c) for these four subset models and the 
original model involving all five groups.2° Both criteria are calculated using 
the penalty term associated with BIC given in (7.35). All statistics are cal- 
culated using the iterated estimators based on the derivative approximation 
with e = 1076. It can be recalled that the selected moment condition is the 
one that minimizes the criterion in question. Therefore, the use of MSC(c) 
selects all the moments in (9.17), but the use of RMSC (c) leads to the choice 
of the subset M1, M3 — M5. The latter suggests that the moment conditions 
E|w:(0o)w:-;(00)] = 0, j = 1,2,...10 are redundant given the moment condi- 
tions in M1, M3 — M5. Table 9.10 reports the iterated estimation results for 
this subset model. Interestingly, once the moments in M2 are omitted, âr is 
roughly two to three times larger than the corresponding estimate based on the 
full set of moments, and the estimates of both the exchange rate and volatility 
equations are now individually significant at the 5% level. 


While it is useful to characterize the conditional variation of financial series, 
this is not an end in itself. Melino and Turnbull (1990) use their model to 
price currency options, but we do not pursue this issue here because it requires 
additional estimations. Parenthetically, we note that they find the stochastic 
volatility model performs better than a number of its competitors in this con- 
text. More generally, stochastic volatility models have been used to analyze a 
variety of financial time series. However, it should be noted that few of these 
studies employ the GMM approach described above. Following the simulation 
evidence reported by Andersen and Sgrensen (1996), researchers have sought 
alternative ways to estimate these models. One such method involves moment 
based estimation using simulation techniques, and this is discussed in Chapter 
10. 


29 Recall from our earlier discussion that M4 must be included to identify po. 
30 See Sections 7.3.1 and 7.3.2 respectively for discussion of MSC and RMSC. 
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Table 9.9 
Moment selection criteria for the stochastic volatility model 
Moments q MSC(c) RMSC(c) 
M1- M5 47 —289.76 —5.41 
M2 -— M5 41 —249.16 —2.36 
M1, M3- M5 37 — 228.39 —7.70 
M1, M2, M4, M5 37 —214.66 —6.08 
M1- M4 37 —213.02 —5.49 


Notes: MSC(c) and RMSC (c) are the moment selection criteria in (7.32) and (7.41) respec- 
tively. 


Table 9.10 
Iterated GMM estimates for the stochastic volatility model 
based on moments M1, M3 — M5 


ar Br or ÛT Çr ÊT Jr 
0.193 —0.002 —0.266 —0.064 0.127 —0.307 19.81 
0.077 0.001 0.067 0.016 0.020 0.619 0.94 


Notes: The numbers below the parameter estimates are the standard errors and the number 
below Jr is the p-value of the test. 


10 


Related Methods of 
Estimation 


In this chapter, we briefly review two other methods for exploiting moment con- 
ditions in estimation. Section 10.1 describes simulation based estimators known 
as Simulated Method of Moments, Indirect Inference and Efficient Method of 
Moments. Section 10.2 describes the method of Empirical Likelihood. The 
purpose of this discussion is to provide the intuition behind the methods in 
question and to explain their connections to GMM. References are provided for 
those readers interested in a rigourous analysis of the statistical properties of 
these estimators. 


10.1 Simulation Based Estimation 


Advances in computer technology have facilitated a growing interest in the use 
of simulation methods to estimate the parameters of economic models based 
on the information in population moment conditions. This approach is feasi- 
ble in models where the data generation process is known apart from certain 
parameters, and so it is possible to generate artificial samples of data for dif- 
ferent values of this parameter vector. The parameter estimator is then the 
value in the parameter space for which moments from the artificial data match 
the corresponding moments in the observed sample. There are two main vari- 
ants of this approach with the difference depending on the choice of moments. 
If the moments are derived from the model of interest then this approach is 
known as Simulated Method of Moments or Method of Simulated Moments. If 
the moments are derived from some auxiliary model then the method is known 
as Indirect Inference or Efficient Method of Moments depending on the precise 
setting. In this section we provide a brief description of these methods and the 
asymptotic properties of the associated estimators. The focus here is on provid- 
ing an intuitive introduction to the method and on relating them to GMM. The 
interested reader is refered to Carrasco and Florens (2002) for a recent survey 
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and Gourieroux and Monfort (1996) for a more comprehensive treatment. 


10.1.1 Simulated Method of Moments 


The basic idea behind Simulated Method of Moments (SMM) is best under- 
stood by considering a simple example. To this end, we return to a modified 
version of the simple example used to introduce the Method of Moments in 
Section 1.2. Suppose that {vs} is a sequence of scalar random variables that 
are independently and identically distributed. As in this earlier discussion, it 
is assumed that E[v,] = uo but here it is assumed that Var[u;,] = 1 and that 
the distribution is normal. Given this specification, it is possible to generate an 
artificial sample for any value of the mean p via 


Unlu) = u + en, N=1,2,...N (10.1) 


where {en;n = 1,2,... N} are random draws from the standard normal distri- 
bution. While the artificial sample {vn(u); n = 1,2,...N} can be generated 
for any choice of u, there is only one such sample that comes from the same 
distribution as the data, namely the one for which u = uo. In consequence, it 
follows that as both T — œ and N > œ: 


T N 
TX v: — N X vn (uo) 2 o 
t=1 n=1 
T N (10.2) 


TS ve — NTS onl) S uo — p A 0, for po F pa 
t=1 n 


=1 


SMM exploits the properties in (10.2) in the natural way: the SMM estimator 
of uo is fir, the value that satisfies mY p(iir) = 0 where 


T N 
mr (H) =p Sou — N! X un(u) (10.3) 
t=1 n=1 


In the above example, it is possible to simulate data to match the sample 
moment because the parameter is just identified. SMM can also be applied in 
overidentified models in a similar fashion to GMM. Continuing the example, 
suppose now that it is desired to base the estimation of uo on the information 
in the first two moments. The resulting SMM estimator of jo is the value of u 
that minimizes 


mEn) | | monty) 
(a) Wr (a) (10.4) 

my (H) my 7 (H) 

where Pe 
mo n(u) = T'Y o — NY v2 (u) (10.5) 

t=1 n=1 


and Wr is a weighting matrix satisfying the conditions in Assumption 3.7. 
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Both these SMM estimators can also be considered as a special case of GMM. 
To elicit this connection, we return to the case where the estimator is based on 
the first moment alone. Some additional notation and structure is also needed. 
First, it is necessary to make some assumption about the relative magnitudes 
of N and T. We follow the common strategy in the literature of assuming 
that N = kT for some fixed positive integer k satisfying k > 1. Notice that 
it is now possible to write the index of the generated random variable u(j) as 
n = k(t—1)+7 where i = 1,2,...k for each t = 1,2,...T. Secondly, using this 
re-indexing, we can now define 


k 
felu) = ve = KY © very +il) (10.6) 
i=1 
Finally, define gr(u) = T7! DE 1 felu). With these definitions, it can be seen 


that mY p(iir) = = 0 can be re-written as 


T 
grlñr) =T) filfir) = 0 


and so the SMM estimator is the GMM estimator based on the population 
moment condition E[f;(J19)] = 0. 

This interpretation means that the consistency and asymptotic normality 
of this SMM estimator can be deduced from Theorems 3.1 and 3.2.1 So, if 
estimation is based on (10.6) then it follows from Theorem 3.2 that? 


T"? (ñr — uo) S N(0,Vsum) (10.7) 


where Vsmm = imr Var[T -1/2 E} fi(uo)]. Assuming that the observed 
and generated samples are independent, it follows 


VsmmM = jm, Var[T AY ol 
k 
+ lim Varir S461 ey (10.8) 


Since both {v;} and {vg(¢—1)4:} are iid normal with mean po and variance 1, 
it follows from (10.8) that 


Vsmm = (1+k7') 


This variance bears a simple relationship to the asymptotic variance of the 
GMM (or MM) estimator based on E[u;] — po = 0. It follows directly from 


1 While true in this example, it is not generally true; see below. 
2 Note p=q=1 and Oft(u)/Oy = 1. 
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Theorem 3.2 and our assumptions about v; here that the asymptotic variance 
of this GMM estimator is Vemm = 1 and so 


VsmMM = (1+ k7')\Vemm (10.9) 


Therefore, the SMM estimator is asymptotically less efficient than the GMM es- 
timator that used the information in the population moment condition directly. 
However, note that the relative inefficiency decreases with k = N/T and is zero 
in the limit as k — oo. The relationship in (10.9) is generic and recurs in more 
complicated models. 

This efficiency ranking illustrates that there are typically no advantages to 
implementing SMM when conventional GMM is feasible. However, there are a 
number of circumstances in which conventional GMM is infeasible and SMM 
then becomes an attractive alternative. SMM was introduced into the econo- 
metric literature by McFadden (1989) in the context of discrete response models. 
Within this setting, estimation can be based on moment conditions involving 
the difference between the observed responses and the expected responses im- 
plied by the model. In some cases, the expected response may be difficult to 
express analytically or to compute via numerical integration, but may be readily 
obtained via simulation. SMM has been used to estimate a variety of micro- 
econometric models, such as a multinomial logit model of transportation choice 
(McFadden and Train, 2000) and a bargaining model with asymmetric infor- 
mation for medical malpractice disputes (Sieg, 2000).? The method has also 
been applied to estimate a number of macroeconomic models, such as the con- 
sumption based asset pricing model (Heaton, 1995), a real business cycle model 
(Collard, Féve, Langot, and Perraudin, 2002) and exchange rates (Iannizzotto 
and Taylor, 1999).4 Since these macroeconomic examples are closer in spirit 
to the types of model considered in Section 1.3 and Chapter 9, we explore one 
of these examples in more detail to illustrate why GMM may be infeasible but 
SMM can be implemented. Of the macroeconomic models listed above, the 
natural choice for such treatment is the consumption based asset pricing model 
studied by Heaton (1995) because this is a variation of the model described in 
Section 1.3.1 that has been used as our running empirical example. 


Example: Heaton’s (1995) Consumption Based Asset Pricing Model 
Heaton (1995) studies a version of the consumption based asset pricing model 
in which the representative agent maximizes 


3 See Gourieroux and Monfort (1996) for other examples. 
4 See Carrasco and Florens (2002) and Gourieroux and Monfort (1996) for additional 
references. 
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where 09 is the discount factor, Q; is the information set at time t, 


oo 
St = y QjCt—j 
j=0 


and c; is consumption expenditure in period t. Notice that the functional form of 
the utility function is the same as in Hansen and Singleton’s (1982) version of the 
model described in Section 1.3.1. The key difference is that the agent now derives 
utility in period t from a linear combination of current and past consumption 
expenditures, and so preferences are not time separable. This specification can 
be motivated by considering consumption to be a durable good and s; as the 
service flow in period t from current and past consumption expenditures. It 
can be recalled from Section 1.3.1 that the completion of the model requires 
assumptions about the investment opportunities available to the agent. For 
simplicity here, it is assumed that the agent can invest in period t in a single 
asset at price p; that matures in period t + 1 with payoff r,,,. In this case, the 
Euler equation is: 


t+1 
z| P EA E a] -1 S50 (10.10) 


muc(t) Pt 


where muc(t) denotes the marginal expected discounted lifetime utility of con- 
sumption, that is 


J yo— a 
muc(t ape ô OA St 45 Qi 


Now consider the problem of how to estimate the model parameters based on 
the information in the Euler equation. As in Section 1.3.1, it is possible to use 
an iterated expectations argument to derive moment conditions based on the 
orthogonality of the function of the data in the Euler equations to any variable 
in the information set. While such moment conditions are the stepping stone 
to GMM estimation in the simpler version of the model in Section 1.3.1, such 
an estimation is infeasible here for the following two reasons: 


e The Euler condition depends on an infinite sum. In practice, for GMM 
estimation, this sum would need to be truncated at order m, say. However, 
Heaton (1995) wishes to test for the existence of long run habit formation 
for which the {a;} must be allowed to decay slowly, and this in turn 
suggests that m needs to be large. However, large values of m lead to a 
high order moving average structure in the error term of the Euler equation 
residual, and existing simulation evidence suggests that GMM estimation 
may be unreliable in these cases.” 


e Heaton (1995) wishes to allow for the case in which the agent makes 
decisions weekly rather than monthly. However, the GMM approach in 


5 See Sections 3.5 and 6.3. 
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Section 1.3.1 only works if data are observed at the frequency at which 
decisions are made, and aggregate consumption data are unavailable at 
higher frequencies than monthly. 


Heaton (1995) shows that it is possible to circumvent both problems if the 
estimation is performed using Simulated Method of Moments. To describe his 
approach, it is necessary to introduce the following notation and structure. 
Assume that there is only one asset, as in our empirical implementation of the 
consumption based asset pricing model. Let t denote the week and assume 
that every month consists of four weeks. Define č, d, and pı to be weekly 
consumption, weekly dividend payments and the price of the asset price at the 
end of the week — so that the weekly asset return is Ti = pet di. 

Heaton (1995) assumes that Y; = [ln(č+/čt—1), ln(d:/d:_1)]’ is generated by 
a subset VAR(12) model with a normally distributed error process. The pa- 
rameters of this VAR are estimated via SMM and are chosen to ensure that the 
implied series for monthly consumption and dividend growth match the first 
moment and certain second moments of actual monthly consumption and div- 
idend growth data. Once the model for Y, is estimated, it is then possible to 
simulate values for Y;. Given values for Y, and the parameters, it is possible to 
solve (10.10) numerically for p. Therefore, the parameters of the Euler equa- 
tion are estimated via SMM and are chosen to ensure that the implied series 
for monthly asset returns matches certain first and second moment properties 
of the actual monthly asset return data. © 


Pakes and Pollard (1989) provide an asymptotic theory for the SMM es- 
timator in models where the data are i.i.d. but the function in the moment 
condition may be discontinuous, as would be the case, for instance, in discrete 
response models. Lee and Ingram (1991) and Duffie and Singleton (1993) pro- 
vide a comparable asymptotic theory for time series models. These authors 
provide conditions under which the SMM estimator is consistent and asymptot- 
ically normal. These conditions are different from those employed in Chapter 3 
for the corresponding of GMM estimators and their precise nature depends on 
both the structure of the moment condition and also on the way the data are 
simulated. We therefore do not pursue this theory further here but refer the 
reader to the aforementioned sources. Lee and Ingram (1991) also show that 
in overidentified models, the SMM minimand can form the basis for a model 
specification test along the same lines as the overidentifying restrictions test 
within the GMM framework. 


10.1.2 Indirect Inference 


Simulated Method of Moments involves simulating data from a model and choos- 
ing the parameters to match moments implied by the same model. However, 
in some cases, it can be desirable to simulate data from one model to match 
moments associated with some other model. This type of estimation is known 
as Indirect Inference, a terminology introduced by Gourieroux, Monfort, and 
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Renault (1993). It is common to refer to the model from which the simulations 
are generated as the simulator, and the “other” model from which the moments 
are constructed as the auailliary model, and we adopt this terminology here. 
Indirect Inference is attractive in circumstances where it is possible to simulate 
data from the model of interest but the complexity of the model makes it im- 
possible to estimate the parameters by conventional approaches, such as GMM. 
To illustrate the approach, we revisit the problem of estimating the parameters 
of a stochastic volatility model. 


Example: Stochastic volatility model 
For simplicity, consider the following special case of the stochastic volatility 
model described in Section 1.3.5, 


Y = ytre (10.11) 
ln(x) = 60,1 + 60,2In(a4-1) + 0o,3Ut (10.12) 


where (e+, uz)’ ~ IN (0, I2). It can be recalled from Section 1.3.5 that the model 
completely specifies the distribution, but that the structure of the model renders 
Maximum Likelihood estimation infeasible. However, since the data generation 
process is known, it is possible to simulate data from the model for a given value 
of 0. This opens the door to the possibility of a simulation based estimation. 
Let {yn(0); n = 1,2,... N} be a sample of simulated values for y from (10.11)- 
(10.12) given 0, and once again we set N = kT for some positive integer k. 

The key question is which moments to match? Gallant and Tauchen (1996) 
argue that the natural choice of moments are the score equations of a closely 
related model. Their argument applies more generally than our specific example 
and their reasoning is based on the following efficiency argument. First suppose 
that the auxilliary model encompasses the simulator; in this case the estimation 
is based on the true score equations and so it can be shown that the resulting 
estimators are asymptotically efficient provided k — oo. Now suppose that the 
auxilliary model does not encompass the simulator but is a good approximation 
in some sense; in this case then the resulting estimator can be thought of as being 
“nearly” asymptotically efficient. For the stochastic volatility model, a natural 
choice of auxilliary model is an alternative model for conditional variation such 
as the autoregressive conditional moving average (ARCH) model proposed by 
Engle (1982). The ARCH model of order d is given by, 


Yy = hi(ao)w: 


where 
d 


h? (a) = Qr + X oaimmw; 
i=1 


and a = (a1,Q2,...Q@a4i)’. Under the assumption that w, ~ IN(0,1), the 
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quasi-log likelihood function is tractable and the associated score equations are:® 


X slar; y] = 0 (10.13) 


t=1 


where âr is the quasi maximum likelihood estimator of ag = (0,1, Q0,2,---; 
ao,a+1)',” and s[.] is given by 


sla; y] = z Ema 
, 2h?2(a) 

for z; = (1, yf_1,Yf_-as+-+ Yea) 
The Indirect Inference estimator of ĝo is defined to be 


N 


í N 
Or = argmingce [x= X s[âr; vio] Wr [x= > sar; vio 


n=1 


where Wry is a weighting matrix satisfying Assumption 3.7. For this estimation 
strategy to work, the auxilliary model must provide at least as many moment 
conditions as parameters to be estimated in the simulator. In this example, this 
restriction translates into the constraint that d+ 1 > 3. Notice that ifd+1> 3 
then ĝo is overidentified by the moments in the auxilliary score, and this is why 
the minimand is a quadratic form. Ko 


Gourieroux, Monfort, and Renault (1993) establish the consistency and 
asymptotic normality of the Indirect Inference estimator. They also show that 
there are a number of alternative ways of setting up the Indirect Inference mini- 
mand based on choosing the parameters of the simulator so that estimates from 
the auxilliary model based on the simulated data match the estimates from the 
auxilliary model based on the observed sample. However, since Gourieroux, 
Monfort, and Renault (1993) show these estimators are asymptotically equiv- 
alent to those based on matching the first derivative of the auxilliary model 
minimand, we do not discuss these alternative approaches here. Gourieroux, 
Monfort, and Renault (1993) also provide a number of specifications tests for 
models estimated by Indirect Inference. In particular, they show that in overi- 
dentified models, the Indirect Inference minimand can form the basis for a model 
specification test along the same lines as the overidentifying restrictions test 
within the GMM framework. 

To conclude this sub-section, we return to the issue raised in the exam- 
ple above of which moments to match. Given Gallant and Tauchen’s (1996) 
efficiency argument, it is clearly desirable that the auxilliary model provides 
the best possible approximation to the data generation process. Therefore 
Gallant and Tauchen (1996) recommend that the auxilliary model involve the 


6 See inter alia Hamilton (1994) [p.661]. 
T See Section 3.8.1. 
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specification that the probability density function of the data takes some flexible 
functional form capable of approximating a wide variety of distributions within 
the class of interest. For time series data such as the stochastic volatility ex- 
ample above, Gallant and Tauchen (1996) propose using a member of the class 
semi-non parametric (SNP) densities proposed by Gallant and Nychka (1987) to 
generate the score in the auxilliary model. SNP densities consist of a lead term, 
such as the conditional density associated with the ARCH(q) model, multiplied 
by an expansion involving Hermite polynomials. Gallant and Nychka (1987) 
show that such a density can recover a wide class of distributions as the order 
of the expansion tends to infinity. Therefore, this type of Indirect Inference es- 
timator is often refered to as Efficient Method of Moments (EMM). Andersen, 
Chung, and Sørensen (1999) provide simulation evidence on the use of EMM 
estimators in the stochastic volatility model, and find the method performs far 
better than the type of GMM estimator implemented in Section 9.4. EMM 
has been used to estimate the parameters of a variety of models; see Carrasco 
and Florens (2002) for a summary. Examples of its use to estimate stochastic 
volatility models are reported in Andersen and Lund (1997) and Gallant, Hsieh, 
and Tauchen (1997) where the applications are to interest rates and stock price 
indices respectively. 


10.2 Empirical Likelihood 


The method of Empirical Likelihood was introduced into the statistical liter- 
ature by Owen (1988). In this and two subsequent articles in 1990 and 1991, 
Owen demonstrated that the method can be used to perform inference about as- 
pects of a distribution or the parameters of a linear regression model. Empirical 
Likelihood has subsequently been extended to cases where it is desired to im- 
pose the restriction that the distribution satisfies a nonlinear moment condition 
indexed by an unknown parameter vector. As a result, this type of estimator is 
attracting an increasing amount of interest in econometrics. In this sub-section, 
we provide a brief introduction to the method and pay particular attention to 
highlighting its connections to GMM. More comprehensive treatments can be 
found in the recent survey article by Imbens (2002) or the recent monograph by 
Owen (2001). 

To introduce the Empirical Likelihood approach to estimation, consider the 
situation in which the researcher observes a random sample of T observations on 
an i.i.d. random variable, v, and wishes to estimate its distribution. In the ab- 
sence of any information about the form of the distribution function, the natural 
estimator is the empirical distribution function, that is the estimated probabil- 
ity of each sample value is 1/T. This approach to estimation can be expressed 
as the outcome of an optimization as follows. Let ù; denote the tt” outcome in 
the sample and m; denote the probability that v = v;. To be valid probabilities, 
it must follow that 0 < m, < 1 and ao mı = 1. The joint probability distribu- 
tion function of the sample is then given by IŁ, Tı. If some parametric model 
had been assumed for 7; then this joint probability distribution function could 
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be treated as the likelihood for the data, and used as a basis for estimating the 
unknown parameters of the distribution. Suppose now that the same step is 
taken here even though there is no assumption about the form of the underly- 
ing distribution function. In this case, the “likelihood”, IŁ, T, is treated as a 
function of the unknown probabilities {m+}. This likelihood interpretation leads 
to the following method for estimating the probabilities: 


T T 
TER mazen | | ™ subject to Som = 1 (10.14) 
t=1 t=1 
where m = (m1, T2,... Tr) and II = [0,1]?. The resulting estimators can be 


shown to be a, = 1/T for all t — in other words, the distribution function is 
estimated by the empirical distribution function. The function IŁ, T+ is known 
as the empirical likelihood. 

This approach can be extended to incorporate information about the mo- 
ments of the unknown distribution. Continuing the example, suppose now that 
it is known that the distribution satisfies E[v] = 0. The empirical distribution 
estimator for the probabilities does not ensure this restriction in the sample 


because in general 

T T 

Xan = TA a O 

t=1 t=1 
However, it is possible to modify the constraint set so that this first moment 
condition is imposed. Specifically, suppose now that the empirical log likelihood 
is maximized subject to the twin constraints that the probabilities sum to one 
and the sample first moment is zero, that is 


T T T 
T= mazzen | | infr: subject to Som = land So mis =0 


t=. t=1 t=1 


It can be shown that estimates of the probabilities are now 7 = (1 + Av,)~+ 
where A is the Lagrange Multiplier associated with the constraint that pare TÙ 
=0. 

This approach can also be extended to the types of population moment 
condition considered in the preceding chapters. Qin and Lawless (1994) de- 
rive the Empirical Likelihood estimators for the case in which it is desired to 
impose the moment restriction that the (q x 1) population moment condition 
E| f(v, 0o)] = 0 holds for some unknown (p x 1) parameter vector, 0o. In this 
case, it is necessary to estimate both the unknown probabilities and also 6o. 
The Empirical Likelihood estimators for this case are defined to be: 


T T T 
(7,0) = mazren,oco | | infa subject to Som = land So mf (G9) =0 


t=1 t=1 t=1 

(10.15) 
Qin and Lawless (1994) show that if q = p then the resulting estimator of m 
is 7, = 1/T for all t, and @ is the Method of Moments estimator based on 
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E|f(v,)] = 0 — and so 6 is also the GMM estimator in this case. However, 
if q > p then the Empirical Likelihood and GMM estimators are different in 
general. 

In comparison to GMM, it can be seen that Empirical Likelihood offers a 
very different way to exploit the information in population moment conditions. 
However, it turns out that the two estimators have the same asymptotic prop- 
erties. Specifically, Qin and Lawless (1994) show that the Empirical Likelihood 
estimator is consistent for #9 and converges to the same limiting distribution as 
the GMM estimator calculated using the optimal weighting matrix.® 

Just as within the GMM framework, one hypothesis of particular interest is 
whether the data are compatible with population moment condition. Imbens, 
Spady, and Johnson (1998) show that this hypothesis can be tested within the 
Empirical Likelihood framework using statistics derived from the Likelihood 
Ratio, Wald and Lagrange Multiplier testing principles. For brevity, we consider 
only the Likelihood Ratio type test here. This test compares the value of the 
empirical likelihood function at the restricted estimates, 7, with the value at 
the unrestricted estimates, 7. The statistic is 


LR- EL = 2{ELLFr(a) — ELLFr(i)} 


where ELLFy(r) = D In|]. Under the null hypothesis that E[f(v,40)] = 
0, Imbens, Spady, and Johnson (1998) show that LR — EL converges to a Xq-p 
distribution. Notice this limiting distribution is exactly the same as the limiting 
distribution of the overidentifying restrictions test under the same null.? 

In terms of these asymptotic properties, there is nothing to choose between 
Empirical Likelihood and GMM estimation. However, there is some recent evi- 
dence that the Empirical Likelihood estimators may exhibit better finite sample 
performance. Newey and Smith (2004) develop Nagar type approximations for 
the approximate bias of the Empirical Likelihood estimator along with those for 
the two step GMM and continuous updating estimators that are discussed in 
Section 6.2.2. They show that the approximate bias of the Empirical Likelihood 
estimator is equal to 7~' By, using the notation defined in Section 6.2.2. There- 
fore the Empirical Likelihood has fewer sources of bias than either the two step 
GMM or continuous updating GMM estimators. Given the existence of these 
types of bias, it is natural to consider using the bootstrap to provide more ac- 
curate finite sample inference. Since Empirical Likelihood approach generates 
probabilities for the data outcomes that are consistent with the moment condi- 
tion, it provides a very computationally convenient way of generating articifial 
data consistent with the estimated model. Brown and Newey (2002) present 
an empirical likelihood based method for the bootstrapping with i.i.d. data and 
show that it is at least as efficient as the methods described in Section 8.1. 

All the studies mentioned above address the behaviour of the Empirical 
Likelihood estimator in the context of iid. data. Kitamura (1997) shows that 


8 See Section 3.4 and 3.6. Also see Chamberlain (1987) and Section 7.2.3. 
9 See Theorem 5.1 in Section 5.1. 
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if the data are generated by a stationary dynamic process the Empirical Like- 
lihood estimator in (10.15) is no longer as asymptotically efficient as the two 
step GMM estimator. However, asymptotic equivalence can be restored if the 
sampling unit is taken to be blocks of observations along similar lines to the 
blocking schemes discussed in Section 8.1.2.1. The probabilities in the empiri- 
cal likelihood are then interpreted as the probability that a particular block is 
sampled. Kitamura (1997) provides conditions under which the resulting Em- 
pirical Likelihood estimators have the same asymptotic distribution as the two 
step GMM estimator. 

There have been a number of variations of the Empirical Likelihood estimator 
proposed in the literature. These variations involve replacing the Empirical 
Likelihood by some other function of the probabilities. It is not our purpose here 
to provide a review of this literature and instead the interested reader is refered 
to Imbens (2002). However, one particular extension is worth noting. Smith 
(1997) introduces the class of Generalized Empirical Likelihood estimators, and 
Newey and Smith (2004) show that this class includes the continuous updating 
GMM estimator but not the two step GMM estimator. This difference is the 
source of the estimators contrasting approximate bias properties discussed in 
Section 6.2.2. 


Appendix A 


Mixing Processes and 
Nonstationarity 


This appendix provides a heuristic introduction to mixing processes followed by 
a brief summary of existing results on GMM in a nonstationary environment. 


A.1 Mixing processes 


As mentioned in the Section 3.4, if vy; is a mixing process then the dependence 
between v, and vy_m disappears as m — oo. To make this definition operational, 
it is necessary to make precise the notion of “dependence”. Several approaches 
have been taken and each yields a different type of mixing process. In this 
appendix, we focus on so called strong or a—mixing processes. The discussion in 
this appendix relies heavily on Davidson (1994) [Chapters 13 and 14] to which 
the reader is refered for a rigorous treatment of this material and definitions of 
other types of mixing processes. 

Although more accessible than ergodicity, the definition of an a-mixing pro- 
cess involves some sophisticated mathematical concepts. Below we build up to a 
formal definition in three steps. First, we introduce the measure of dependence 
in the context of two specific sets of elementary events. Secondly, this measure 
is extended to cover the dependence between two collection of sets. Finally, we 
show how this measure can be used to capture the dependence structure of a 
stochastic process. 

To begin, it is useful to recall from elementary probability theory that if two 
events G and H are independent then P(GN H) = P(G)P(#). If G and H 
are dependent then the converse is true, namely P(G N H) — P(G)P(H) £ 0. 
These two basic properties suggest that 


a'(G, H) = P(GN H) — P(G)P(H) 


provides a reasonable starting place in our search for a measure of dependence 
between G and H. However, as it stands, a'(G, H) has one unattractive feature. 
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The measure a'(G, H) makes a distinction between cases a’(G,, H1) = c and 
a' (G2, H2) = —c whereas intuition suggests these two cases are both the same 
“distance” from independence. In other words, it is preferable to capture the 
dependence between G and H using 


a(G,H) = |P(GNH) — P(G)P(H)| (A.1) 


We now turn to the extension of this measure of dependence to collection of 
sets. For conformity with what follows below, it is useful to give these collections 
of sets certain properties. 


Definition A.1 o-field 

Let F be a collection of subsets of the set Q. Then F is ao-field (or o-algebra) 
if it satisfies the following three conditions: (i) Q € F; (ü) if A € F then 
Ac € F; (ti) if {An, n =1,2,...} is a sequence of sets in F then UX An E F. 


Now define G and H to be two o-fields, and let GE G and H € H. As we have 
seen G and H are independent if a(G, H) = 0. For G and H to be independent, 
it must be the case that a(G, H) = 0 for all Œ € G and all H € H or, more 
compactly, a(G, H) = 0 where 


a(G,H) = supaeg,nen a(G, H) (A.2) 


Notice that if G and H are dependent then there must be some G € G and 
H € H for which a(G, H) 4 0 and so a(G,H) # 0. As the notation suggests 
a(G, H) forms the basis for the measure of dependence in the definition of an 
a-mixing process. 

The last step towards the formal definition of an a-mixing process involves 
the adaptation of the measure of dependence to time series. At this stage, it is 
necessary to introduce certain concepts relating to stochastic processes. These 
concepts are stated, but not explained, because such an explanation is beyond 
the scope of this book. It is hoped that the previous discussion is sufficient to 
convey the intuition behind the definition of a mixing process. We refer the 
interested reader to Davidson (1994) [Chapters 12-14] for a rigorous treatment 
of stochastic processes, dependence and mixing processes. 

Consider the stochastic process {v;(w)} defined on the probability space 
(Q, F, P). Let Ft be the smallest o-field on which (vs, Us41,--- Ue) is measurable. 
Two particular ø fields are of interest here: F*.,, which can be thought of as the 
“the information contained in the sequence up to date t”; and FPP, which can 
be thought of as “the information contained in the sequence from t+m onwards” . 
From our previous discussion, we can capture the dependence between Ft ,, and 
Fm by 

am = OF o Ffm) 
With this background, we can finally present the definition towards which we 


have been working. 


Definition A.2 a—mixing process 


The sequence {v}? _ ~ is said to be a-mixing if limm—oo am =0. 
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Therefore, an a-mixing process is one in which the dependence between two 
observations, &m, decays to zero as m — oo. One implication of this definition 
is that the autocovariances of v; exhibit a similar decay, that is! 


Cov(vz, Vi-m) — 0 as m — œ (A.3) 


This particular property provides a useful glimpse into the difference between 
mixing and ergodicity, because the latter implies? 


M 
Mt 5 Cov(vt, vt-m) —> 0 as M — œ (A.4) 
m=1 


Therefore, mixing implies the autocovariances decay to zero as m — oo but 
ergodicity implies the average of the first M autocovariances tends to zero as 
M — œ. The former implies the latter but not vice versa. From a comparison 
of (A.3) and (A.4), it can be seen that some generality is lost by moving from 
ergodic to mixing processes. However, since (A.3) is plausible for many economic 
series, it may be argued that not much is lost. 


For the asymptotic analysis in Chapter 3 it is insufficient for the dependence 
simply to decay to zero, but it must do so at a particular rate. The rate of 
decay has been captured in the literature using the concept of size of a mixing 
process. 


Definition A.3 Size of a mixing process 
vu, is an a-mixing process of size —co if Am = O(m~*S) for some c > co. 


This definition implies that the larger the size then the greater the dependence 
allowed in the series. Obviously it is desirable to allow for as much dependence 
as possible. However, the issue is not that simple, because dependence is just one 
feature of the series which must be restricted to permit conventional asymptotic 
analysis. It is also necessary to place restrictions on the existence of certain 
moments, and hence implicitly on the tail behaviour of v,;. As an illustration, 
consider the conditions imposed by Andrews (1991) to underpin his analysis of 
the asymptotic properties of covariance matrix estimators which are discussed 
in Section 3.5.3. He imposes the following two conditions: (i) vs is a-mixing of 
size —3v/(v — 1); (ii) E]||v;||4”] < œ. Inspection reveals that as v increases the 
size increases and so the degree of dependence allowed increases. However, at 
the same time, an increase in v increases the order up to which the moments of 
vz must exist. The latter is implicitly a restriction on the tail behaviour of the 


1 See Davidson (1994) [p.203]. 

2 See Davidson (1994) [p.201]. Note that this condition is sufficient but not necessary for 
ergodicity unless {v+} is a Gaussian process in which case it is a necessary condition as well. 

3 This includes the Weak Law of Large Numbers, Central Limit Theorem and also the 
consistency of the covariance matrix estimators discussed in Section 3.5. 
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distribution of v. So there is a tension between these two aspects of v; in the 
context of asymptotic analysis. 

In the course of the analysis in Chapter 3, it is necessary to apply limit theo- 
rems to various functions of v;. One particularly attractive feature of a-mixing 
processes is that certain functions of them are also a-mixing. Specifically, if v 
is an a-mixing of size —co then Y; = g(vz, v¢-1,-.. Ue_7) is also an be a-mixing 
of size —cg provided T is finite. The reason for this proviso is readily understood. 
If 7 is infinite, then both Y; and Y;—m are functions of {v4-n, n > m} regardless 
of how large m becomes, and so Y; cannot be an a-mixing process in general. 
However, it turns out that this situation can be circumvented by restricting g(.) 
to be near epoch dependent. Essentially, this condition restricts g(.) so that the 
dependence in Y; decays sufficiently fast to allow the derivation of Laws of Large 
Numbers and Central Limit Theorems. We do not pursue this topic here but 
instead refer the interested reader to Davidson (1994) [Chapter 17]. 


A.2 Nonstationarity 


If stationarity is relaxed then it becomes necessary to make an explicit assump- 
tion about the nature of the nonstationarity. Three approaches have been taken 
in the literature: (i) nonstationary mixing processes; (ii) deterministic trends; 
(iii) unit root processes. We now provide a brief summary of the available results 
in each case. 


e Mixing processes: The concept of a mixing process can be extended to 
nonstationary processes by setting am = sup, a(Fi,F Pm). Subject 
to certain restrictions, Weak Laws of Large Numbers and Central Limit 
Theorems can be developed for nonstationary mixing processes; see Gal- 
lant and White (1988) or Pötscher and Prucha (1997). It is then possible 
to establish the consistency and asymptotic normality of the estimator 
within this environment; see Gallant (1987), Gallant and White (1988) 
and Pétscher and Prucha (1997). 


e Deterministic trends: Andrews and McDermott (1995) consider the case in 
which the data are generated by vu; = u(d;, w+) where d; is a deterministic 
trend and w is a stationary process. They provide conditions under which 
the GMM estimator is consistent and asymptotically normal within this 
set-up. 


e Unit root processes: If v; contains unit root processes then the limiting 
distribution theory is non-standard. To date, progress has only been made 


4 For example, consider the case in which vz is scalar and possesses a Student t distribution 
with 6 degrees of freedom. As the parameter 6 decreases the tails of the distribution become 
“thicker” and this has implications for the moments. Specifically, the moments of the t 
distribution only exist up to order 6. Therefore the restriction E|||vz||4”] = E[vf”] < œœ 
implies 6 > 4v, and thereby implicitly places a restriction on the thickness of the tails of the 
distribution. See Johnson and Kotz (1970) [Chapter 27] for a discussion of the properties of 
the t distribution. 
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for the particular case of IV estimation of linear models. Hall (1987b) and 
Pantula and Hall (1991) analyze the behaviour of unit root tests based 
on IV estimators. Phillips and Hansen (1990) and Kitamura and Phillips 
(1997) analyze the limiting behaviour of IV estimators in a multivariate 
setting. 
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